Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Skip subgroup tests when subgroup size doesn't divide local size #15520

Closed
wants to merge 1 commit into from

Conversation

PietroGhg
Copy link
Contributor

Skips tests in Subgroup/barrier.cpp and Subgroup/common.cpp when the maximum sub group size doesn't divide the local size.

@PietroGhg PietroGhg requested a review from a team as a code owner September 26, 2024 09:24
@PietroGhg
Copy link
Contributor Author

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

@steffenlarsen
Copy link
Contributor

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

I don't think that is right. I don't believe the ordering of the items are guaranteed, but get_local_id() must be less than get_local_range() as that range is the size of the group and the id represents the given items position therein. @gmlueck & @Pennycook for comment.

@Pennycook
Copy link
Contributor

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

I don't think that is right. I don't believe the ordering of the items are guaranteed, but get_local_id() must be less than get_local_range() as that range is the size of the group and the id represents the given items position therein. @gmlueck & @Pennycook for comment.

I think @PietroGhg might be right. The description of sub-groups in SYCL 2020 is incredibly vague, unfortunately.

The line that I think is problematic is this one:

size_t SGoff = gid - lid;

...because it assumes that the linear numbering of work-items within a work-group is the same as the linear numbering of work-items within a sub-group. This is probably safe in practice and it's something we wanted to clarify (there's wording in one of the sub-group extensions, somewhere...), but I don't think it's actually guaranteed by the specification.

@PietroGhg
Copy link
Contributor Author

I believe that the tests assume that the subgroup id for items in the "peeling" subgroups is increased sequentially, but it may not be the case depending on the backend. E.g. in common.cpp with a local size of 60 and a max subgroup size of 8, there will be 4 remaining elements, and the assumes that their subgroup ids will be 0,1,2,3 (all in the same subgroup) but I think that having 4 subgroups (so the ids would be 0,0,0,0) is also valid. I may be wrong here so feedback is welcome, thanks :)

I don't think that is right. I don't believe the ordering of the items are guaranteed, but get_local_id() must be less than get_local_range() as that range is the size of the group and the id represents the given items position therein. @gmlueck & @Pennycook for comment.

I think @PietroGhg might be right. The description of sub-groups in SYCL 2020 is incredibly vague, unfortunately.

The line that I think is problematic is this one:

size_t SGoff = gid - lid;

...because it assumes that the linear numbering of work-items within a work-group is the same as the linear numbering of work-items within a sub-group. This is probably safe in practice and it's something we wanted to clarify (there's wording in one of the sub-group extensions, somewhere...), but I don't think it's actually guaranteed by the specification.

My point is not only about linearity, but also about how work items are divided into subgroups in case the max_subgroup_size doesn't divide the local size: in common.cpp we have for (int j = 0; j < G; j++) and group_id = j % L / sg_size;, which I think assumes that "left over" work items are grouped together in the same subgroup, but to me it also looks reasonable to handle those work item differently e.g. having each of them in his own subgroup (with local_range = 1 and sub_group_id = 0)

@steffenlarsen
Copy link
Contributor

steffenlarsen commented Sep 27, 2024

I agree on the point of linearity. This is something we falsely assumed in the CTS as well and had to change. However, I still believe the "left-over" group should have items with IDs within the local range it returns. I.e. say we have a launch with a local size of 12 and a sub-group size of 8, then I argue that the specification says that a valid set of IDs for the groups would be:

Sub-group 0: {0,2,1,3,4,6,5,7}
Sub-group 1: {0,3,1,2}

Or some shuffling of these. However, I do not believe it is right to have:

Sub-group 0: {0,2,1,3,4,6,5,7}
Sub-group 1: {0,3,6,7}

In sub-group 1 there are only 4 items, so even if the max sub-group size is 8, from SYCL's (and the user's) perspective there are only 4 items in that group.

As such, the tests should not skip just because the last group might be smaller. We should be able to get enough information from the kernels to do the checks based on the above and just not do anything outside the range of the last group.

@Pennycook
Copy link
Contributor

My point is not only about linearity, but also about how work items are divided into subgroups in case the max_subgroup_size doesn't divide the local size: in common.cpp we have for (int j = 0; j < G; j++) and group_id = j % L / sg_size;, which I think assumes that "left over" work items are grouped together in the same subgroup, but to me it also looks reasonable to handle those work item differently e.g. having each of them in his own subgroup (with local_range = 1 and sub_group_id = 0)

Oh, I see. Sorry, I only looked at the changes for the specific test rather than the changes to common. You are correct that there is no guarantee all the "left over" work-items will be in the same sub-group.

However, I still believe the "left-over" group should have items with IDs within the local range it returns

This is true. The issue is that you can't assume you will only get two sub-groups in this case. An implementation is free to put all the work-items in one remainder sub-group, create a bunch of sub-groups of size 1, or even create a few sub-groups of different sizes. The only restriction is that each sub-group must be <= the requested maximum sub-group size.

We really need to fix this. 😅

@steffenlarsen
Copy link
Contributor

Ah, I see! Either way, it should be possible to rewrite these tests to work around this by saving each of their local sizes and base the checks on that, right?

However, if we think the specification should be stricter, I would argue that a first step would be to guarantee that our backends follow the stricter pattern. Since these tests are for our implementation, we could make the stronger assumptions based on that for now.

@Pennycook
Copy link
Contributor

The relevant SPIR-V wording is being discussed over on #11301. @steffenlarsen, @PietroGhg, if you could take a look at that and give feedback it would be very helpful. Once the SPIR-V extension is defined, we can expose equivalent SYCL features.

@PietroGhg
Copy link
Contributor Author

Thanks @Pennycook, the key point I was missing here is that the SYCL spec doesn't mandate that there can be only one subgroup with size < max_subgroup_size (which is kinda what the tests seem to assume), but this is mandated by the OpenCL spec, and so the test passes using OpenCL. We see a failure on Native CPU because we have multiple subgroups of size 1. I'm not sure the SPIR-V extension you linked covers this. In general I'm fine with closing this PR if there are short-medium terms plans on changing the SYCL spec to match the OpenCL behaviour (and we'll adapt Native CPU to match what the spec says), otherwise I think we should refactor the tests to account for having multiple subgroups with size < max_subgroup_size. What do you think?

@Pennycook
Copy link
Contributor

I'm not sure the SPIR-V extension you linked covers this.

It does; it says:

With this mapping, the SubgroupLocalInvocationId is equal to the inner-most dimension of the workgroup local ID modulo the SubgroupMaxSize (or, that get_sub_group_local_id() equals get_local_id(0) % get_max_sub_group_size()).

It's a little indirect, but this modulo behavior would prevent using multiple sub-groups of size 1 to handle the case where things do not divide nicely.

In general I'm fine with closing this PR if there are short-medium terms plans on changing the SYCL spec to match the OpenCL behaviour (and we'll adapt Native CPU to match what the spec says), otherwise I think we should refactor the tests to account for having multiple subgroups with size < max_subgroup_size. What do you think?

This is a tricky one, and I think it really comes down to how much work you think it would be to refactor the tests... I think I'm leaning towards making sure that the Naive CPU device can pass the tests as written, though, since most users of that backend will probably want the behavior to match GPU devices.

Note that it would still be useful for the Native CPU device to support the mode where it generates sub-groups of size 1. Even with the SPIR-V extension, we're planning to provide a mode for developers to declare that they don't need to know the sub-group mapping, and SYCL parallel_for doesn't expose sub-groups to the user. In both of those situations, generating a sub-group of size 1 would be allowed.

@PietroGhg
Copy link
Contributor Author

Thanks @Pennycook @steffenlarsen, I understand subgroups better now. I'll close this PR and have a discussion with the rest of my team about the best way to proceed for Native CPU.

@PietroGhg PietroGhg closed this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants