[SYCL] Skip subgroup tests when subgroup size doesn't divide local size #15520

PietroGhg · 2024-09-26T09:24:33Z

Skips tests in Subgroup/barrier.cpp and Subgroup/common.cpp when the maximum sub group size doesn't divide the local size.

PietroGhg · 2024-09-26T09:29:07Z

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

steffenlarsen · 2024-09-26T14:06:01Z

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

I don't think that is right. I don't believe the ordering of the items are guaranteed, but get_local_id() must be less than get_local_range() as that range is the size of the group and the id represents the given items position therein. @gmlueck & @Pennycook for comment.

Pennycook · 2024-09-26T14:42:20Z

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

I don't think that is right. I don't believe the ordering of the items are guaranteed, but get_local_id() must be less than get_local_range() as that range is the size of the group and the id represents the given items position therein. @gmlueck & @Pennycook for comment.

I think @PietroGhg might be right. The description of sub-groups in SYCL 2020 is incredibly vague, unfortunately.

The line that I think is problematic is this one:

size_t SGoff = gid - lid;

...because it assumes that the linear numbering of work-items within a work-group is the same as the linear numbering of work-items within a sub-group. This is probably safe in practice and it's something we wanted to clarify (there's wording in one of the sub-group extensions, somewhere...), but I don't think it's actually guaranteed by the specification.

PietroGhg · 2024-09-26T16:49:27Z

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

I don't think that is right. I don't believe the ordering of the items are guaranteed, but get_local_id() must be less than get_local_range() as that range is the size of the group and the id represents the given items position therein. @gmlueck & @Pennycook for comment.

I think @PietroGhg might be right. The description of sub-groups in SYCL 2020 is incredibly vague, unfortunately.

The line that I think is problematic is this one:
size_t SGoff = gid - lid;
...because it assumes that the linear numbering of work-items within a work-group is the same as the linear numbering of work-items within a sub-group. This is probably safe in practice and it's something we wanted to clarify (there's wording in one of the sub-group extensions, somewhere...), but I don't think it's actually guaranteed by the specification.

My point is not only about linearity, but also about how work items are divided into subgroups in case the max_subgroup_size doesn't divide the local size: in common.cpp we have for (int j = 0; j < G; j++) and group_id = j % L / sg_size;, which I think assumes that "left over" work items are grouped together in the same subgroup, but to me it also looks reasonable to handle those work item differently e.g. having each of them in his own subgroup (with local_range = 1 and sub_group_id = 0)

steffenlarsen · 2024-09-27T04:59:52Z

I agree on the point of linearity. This is something we falsely assumed in the CTS as well and had to change. However, I still believe the "left-over" group should have items with IDs within the local range it returns. I.e. say we have a launch with a local size of 12 and a sub-group size of 8, then I argue that the specification says that a valid set of IDs for the groups would be:

Sub-group 0: {0,2,1,3,4,6,5,7}
Sub-group 1: {0,3,1,2}

Or some shuffling of these. However, I do not believe it is right to have:

Sub-group 0: {0,2,1,3,4,6,5,7}
Sub-group 1: {0,3,6,7}

In sub-group 1 there are only 4 items, so even if the max sub-group size is 8, from SYCL's (and the user's) perspective there are only 4 items in that group.

As such, the tests should not skip just because the last group might be smaller. We should be able to get enough information from the kernels to do the checks based on the above and just not do anything outside the range of the last group.

Pennycook · 2024-09-27T13:34:01Z

My point is not only about linearity, but also about how work items are divided into subgroups in case the max_subgroup_size doesn't divide the local size: in common.cpp we have for (int j = 0; j < G; j++) and group_id = j % L / sg_size;, which I think assumes that "left over" work items are grouped together in the same subgroup, but to me it also looks reasonable to handle those work item differently e.g. having each of them in his own subgroup (with local_range = 1 and sub_group_id = 0)

Oh, I see. Sorry, I only looked at the changes for the specific test rather than the changes to common. You are correct that there is no guarantee all the "left over" work-items will be in the same sub-group.

However, I still believe the "left-over" group should have items with IDs within the local range it returns

This is true. The issue is that you can't assume you will only get two sub-groups in this case. An implementation is free to put all the work-items in one remainder sub-group, create a bunch of sub-groups of size 1, or even create a few sub-groups of different sizes. The only restriction is that each sub-group must be <= the requested maximum sub-group size.

We really need to fix this. 😅

steffenlarsen · 2024-09-27T14:07:14Z

Ah, I see! Either way, it should be possible to rewrite these tests to work around this by saving each of their local sizes and base the checks on that, right?

However, if we think the specification should be stricter, I would argue that a first step would be to guarantee that our backends follow the stricter pattern. Since these tests are for our implementation, we could make the stronger assumptions based on that for now.

Pennycook · 2024-09-27T14:08:20Z

The relevant SPIR-V wording is being discussed over on #11301. @steffenlarsen, @PietroGhg, if you could take a look at that and give feedback it would be very helpful. Once the SPIR-V extension is defined, we can expose equivalent SYCL features.

PietroGhg · 2024-10-04T15:10:41Z

Thanks @Pennycook, the key point I was missing here is that the SYCL spec doesn't mandate that there can be only one subgroup with size < max_subgroup_size (which is kinda what the tests seem to assume), but this is mandated by the OpenCL spec, and so the test passes using OpenCL. We see a failure on Native CPU because we have multiple subgroups of size 1. I'm not sure the SPIR-V extension you linked covers this. In general I'm fine with closing this PR if there are short-medium terms plans on changing the SYCL spec to match the OpenCL behaviour (and we'll adapt Native CPU to match what the spec says), otherwise I think we should refactor the tests to account for having multiple subgroups with size < max_subgroup_size. What do you think?

Pennycook · 2024-10-07T07:40:35Z

I'm not sure the SPIR-V extension you linked covers this.

It does; it says:

With this mapping, the SubgroupLocalInvocationId is equal to the inner-most dimension of the workgroup local ID modulo the SubgroupMaxSize (or, that get_sub_group_local_id() equals get_local_id(0) % get_max_sub_group_size()).

It's a little indirect, but this modulo behavior would prevent using multiple sub-groups of size 1 to handle the case where things do not divide nicely.

In general I'm fine with closing this PR if there are short-medium terms plans on changing the SYCL spec to match the OpenCL behaviour (and we'll adapt Native CPU to match what the spec says), otherwise I think we should refactor the tests to account for having multiple subgroups with size < max_subgroup_size. What do you think?

This is a tricky one, and I think it really comes down to how much work you think it would be to refactor the tests... I think I'm leaning towards making sure that the Naive CPU device can pass the tests as written, though, since most users of that backend will probably want the behavior to match GPU devices.

Note that it would still be useful for the Native CPU device to support the mode where it generates sub-groups of size 1. Even with the SPIR-V extension, we're planning to provide a mode for developers to declare that they don't need to know the sub-group mapping, and SYCL parallel_for doesn't expose sub-groups to the user. In both of those situations, generating a sub-group of size 1 would be allowed.

PietroGhg · 2024-10-09T07:29:09Z

Thanks @Pennycook @steffenlarsen, I understand subgroups better now. I'll close this PR and have a discussion with the rest of my team about the best way to proceed for Native CPU.

Skip subgroup test when subgroup size doesn't divide local size

3af93f3

PietroGhg requested a review from a team as a code owner September 26, 2024 09:24

PietroGhg requested a review from maarquitos14 September 26, 2024 09:24

PietroGhg temporarily deployed to WindowsCILock September 26, 2024 09:25 — with GitHub Actions Inactive

PietroGhg had a problem deploying to WindowsCILock September 26, 2024 09:45 — with GitHub Actions Failure

PietroGhg closed this Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Skip subgroup tests when subgroup size doesn't divide local size #15520

[SYCL] Skip subgroup tests when subgroup size doesn't divide local size #15520

PietroGhg commented Sep 26, 2024

PietroGhg commented Sep 26, 2024

steffenlarsen commented Sep 26, 2024

Pennycook commented Sep 26, 2024

PietroGhg commented Sep 26, 2024

steffenlarsen commented Sep 27, 2024 •

edited

Loading

Pennycook commented Sep 27, 2024

steffenlarsen commented Sep 27, 2024

Pennycook commented Sep 27, 2024

PietroGhg commented Oct 4, 2024

Pennycook commented Oct 7, 2024

PietroGhg commented Oct 9, 2024

[SYCL] Skip subgroup tests when subgroup size doesn't divide local size #15520

[SYCL] Skip subgroup tests when subgroup size doesn't divide local size #15520

Conversation

PietroGhg commented Sep 26, 2024

PietroGhg commented Sep 26, 2024

steffenlarsen commented Sep 26, 2024

Pennycook commented Sep 26, 2024

PietroGhg commented Sep 26, 2024

steffenlarsen commented Sep 27, 2024 • edited Loading

Pennycook commented Sep 27, 2024

steffenlarsen commented Sep 27, 2024

Pennycook commented Sep 27, 2024

PietroGhg commented Oct 4, 2024

Pennycook commented Oct 7, 2024

PietroGhg commented Oct 9, 2024

steffenlarsen commented Sep 27, 2024 •

edited

Loading