Skip to content

[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: sycl
Choose a base branch
from

Conversation

DBDuncan
Copy link
Contributor

@DBDuncan DBDuncan commented Jun 6, 2025

sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.

sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.
@DBDuncan DBDuncan requested a review from a team as a code owner June 6, 2025 15:29
@DBDuncan DBDuncan requested a review from aelovikov-intel June 6, 2025 15:29
Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the throw from these SYCL classes instead?

@@ -3154,13 +3162,11 @@ _ZN4sycl3_V15queue20wait_and_throw_proxyERKNS0_6detail13code_locationE
_ZN4sycl3_V15queue22memcpyFromDeviceGlobalEPvPKvbmmRKSt6vectorINS0_5eventESaIS6_EE
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_14SubmissionInfoERKNS2_13code_locationEb
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb
_ZNK4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it wasn't you who messed up the sorting here, but please either remove unnecessary changes or clean it up with a preceding PR to just restore the sorting.

Comment on lines 73 to 82
NDRDescT(sycl::range<Dims_> N, bool SetNumWorkGroups) : Dims{size_t(Dims_)} {
if (SetNumWorkGroups) {
for (size_t I = 0; I < Dims_; ++I) {
NumWorkGroups[I] = N[I];
}
} else {
for (size_t I = 0; I < Dims_; ++I) {
GlobalSize[I] = N[I];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really weird to me. I know you didn't introduce this SetNumWorkGroups thing, but it's odd.

From a quick glance, it looks like:

  • We always store the range passed to the constructor, but potentially in different places.
  • NumWorkGroups is only used by hierarchical parallelism (parallel_for_work_group, specifically).

Could we flip the logic here, so that the constructor always unconditionally stores into GlobalSize, and the parallel_for_work_group code knows to read GlobalSize instead of NumWorkGroups?

Copy link
Contributor Author

@DBDuncan DBDuncan Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been looking into this. There is some confusing stuff going on in the unmodified version of handler.cpp:

llvm/sycl/source/handler.cpp

Lines 1037 to 1079 in 22c8d2f

case kernel_param_kind_t::kind_stream: {
// Stream contains several accessors inside.
stream *S = static_cast<stream *>(Ptr);
detail::AccessorBaseHost *GBufBase =
static_cast<detail::AccessorBaseHost *>(&S->GlobalBuf);
detail::AccessorImplPtr GBufImpl = detail::getSyclObjImpl(*GBufBase);
detail::Requirement *GBufReq = GBufImpl.get();
addArgsForGlobalAccessor(
GBufReq, Index, IndexShift, Size, IsKernelCreatedFromSource,
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);
++IndexShift;
detail::AccessorBaseHost *GOffsetBase =
static_cast<detail::AccessorBaseHost *>(&S->GlobalOffset);
detail::AccessorImplPtr GOfssetImpl = detail::getSyclObjImpl(*GOffsetBase);
detail::Requirement *GOffsetReq = GOfssetImpl.get();
addArgsForGlobalAccessor(
GOffsetReq, Index, IndexShift, Size, IsKernelCreatedFromSource,
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);
++IndexShift;
detail::AccessorBaseHost *GFlushBase =
static_cast<detail::AccessorBaseHost *>(&S->GlobalFlushBuf);
detail::AccessorImplPtr GFlushImpl = detail::getSyclObjImpl(*GFlushBase);
detail::Requirement *GFlushReq = GFlushImpl.get();
size_t GlobalSize = impl->MNDRDesc.GlobalSize.size();
// If work group size wasn't set explicitly then it must be recieved
// from kernel attribute or set to default values.
// For now we can't get this attribute here.
// So we just suppose that WG size is always default for stream.
// TODO adjust MNDRDesc when device image contains kernel's attribute
if (GlobalSize == 0) {
// Suppose that work group size is 1 for every dimension
GlobalSize = impl->MNDRDesc.NumWorkGroups.size();
}
addArgsForGlobalAccessor(GFlushReq, Index, IndexShift, Size,
IsKernelCreatedFromSource, GlobalSize, impl->MArgs,
IsESIMD);
++IndexShift;
addArg(kernel_param_kind_t::kind_std_layout, &S->FlushBufferSize,
sizeof(S->FlushBufferSize), Index + IndexShift);
break;

It looks like to me that it is expected that it can be the case the GlobalSize is zero. This means that addArgsForGlobalAccessor are called with the size argument set to zero and then later on GlobalSize is checked if it is zero and it it is it is set to the size of NumWorkGroups.

I am not quite sure how this is working. I would have expected there to be issues passing size of zero to addArgsForGlobalAccessor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other place that NumWorkGroups is used is in adjectNDRangePerKernel in sycl/source/detail/scheduler/commands.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because AccImpl->PerWI just happens to be false so GlobalSize is not used. Not sure what that variable is meant to signal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try using a single variable, and see if anything breaks?

If the other parts of the code are checking for zero GlobalSize and then reading NumWorkGroups instead, it seems like you could just remove the check and read GlobalSize unconditionally,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem to break things. Likely what is happening is if GlobalSize is all zeros then it is implicitly implied that GlobalSize needs to be modifed. Such as in adjustNDRangePerKernel where GlobalSize is checked if zero and if so, GlobalSize is set to work group size * NumWorkGroups.

A lot of very annoying side effects going on.

Copy link
Contributor Author

@DBDuncan DBDuncan Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also this comment near the bottom of this class:

  /// Number of workgroups, used to record the number of workgroups from the
  /// simplest form of parallel_for_work_group. If set, all other fields must be
  /// zero
  std::array<size_t, 3> NumWorkGroups{0, 0, 0};
  std::array<size_t, 3> ClusterDimensions{1, 1, 1};

DBDuncan added 2 commits June 11, 2025 15:58
… extra dimensions to zero or one respectively weather LocalSizes is zero or not respectively
}

for (int I = Dims_; I < 3; ++I) {
LocalSize[I] = LocalSizes[0] ? 1 : 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of tests that depend on extra LocalSize dimensions higher than Dims_ being set to zero or one depending on whether LocalSizes[I] is zero or not respectively. RequiredWGSize.NoRequiredSize and RequiredWGSize.HasRequiredSize always fail if extra LocalSize dimensions are always set to 1 and various tests such as work_group_size_prop.cpp and six others fail if extra LocalSize dimensions are always set to zero. This preserves the old behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to me that this was introduced in the first place. It really should not matter what the value of dimensions higher than Dims_ are and should just be ignored. But now a number of tests depend on this behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO to revisit this?

This sort of complexity will have a (small) impact on runtime, but it's also going to make it harder to make changes to NDRDescT later on. Making sure NDRDescT returns values we can't explain just to satisfy existing tests is one way to proceed -- but we could also look into whether those tests are actually useful, or rewrite them (and related functionality) to do the right thing.

… by setting extra dimension values to zero when using spercific constructor
@DBDuncan
Copy link
Contributor Author

DBDuncan commented Jun 13, 2025

Working on improving performance with this PR has lead me to hopefully make it more explicit in what values the members of NDRDescT are set to when constructed. There are a lot of tests that rely on 1 or 0 to be set for dimensions that are not used and other behaviour I have had to preserve to get the CI to pass.

Not looking to fix the root cause of that here but to at least hopefully make it more obvious what is going on.

@DBDuncan
Copy link
Contributor Author

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

@aelovikov-intel
Copy link
Contributor

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

Change them to asserts maybe, if the spec doesn't require them. Or move up the callstack if the spec says the users of these classes must throw.

@DBDuncan
Copy link
Contributor Author

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

Change them to asserts maybe, if the spec doesn't require them. Or move up the callstack if the spec says the users of these classes must throw.

Looks to be from this extension: sycl/doc/extensions/proposed/sycl_ext_codeplay_cuda_cluster_group.asciidoc

I do not see any requirement to throw so asserts should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants