-
Notifications
You must be signed in to change notification settings - Fork 772
[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the throw
from these SYCL classes instead?
@@ -3154,13 +3162,11 @@ _ZN4sycl3_V15queue20wait_and_throw_proxyERKNS0_6detail13code_locationE | |||
_ZN4sycl3_V15queue22memcpyFromDeviceGlobalEPvPKvbmmRKSt6vectorINS0_5eventESaIS6_EE | |||
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_14SubmissionInfoERKNS2_13code_locationEb | |||
_ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb | |||
_ZNK4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it wasn't you who messed up the sorting here, but please either remove unnecessary changes or clean it up with a preceding PR to just restore the sorting.
sycl/source/detail/cg.hpp
Outdated
NDRDescT(sycl::range<Dims_> N, bool SetNumWorkGroups) : Dims{size_t(Dims_)} { | ||
if (SetNumWorkGroups) { | ||
for (size_t I = 0; I < Dims_; ++I) { | ||
NumWorkGroups[I] = N[I]; | ||
} | ||
} else { | ||
for (size_t I = 0; I < Dims_; ++I) { | ||
GlobalSize[I] = N[I]; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really weird to me. I know you didn't introduce this SetNumWorkGroups
thing, but it's odd.
From a quick glance, it looks like:
- We always store the
range
passed to the constructor, but potentially in different places. NumWorkGroups
is only used by hierarchical parallelism (parallel_for_work_group
, specifically).
Could we flip the logic here, so that the constructor always unconditionally stores into GlobalSize
, and the parallel_for_work_group
code knows to read GlobalSize
instead of NumWorkGroups
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been looking into this. There is some confusing stuff going on in the unmodified version of handler.cpp:
Lines 1037 to 1079 in 22c8d2f
case kernel_param_kind_t::kind_stream: { | |
// Stream contains several accessors inside. | |
stream *S = static_cast<stream *>(Ptr); | |
detail::AccessorBaseHost *GBufBase = | |
static_cast<detail::AccessorBaseHost *>(&S->GlobalBuf); | |
detail::AccessorImplPtr GBufImpl = detail::getSyclObjImpl(*GBufBase); | |
detail::Requirement *GBufReq = GBufImpl.get(); | |
addArgsForGlobalAccessor( | |
GBufReq, Index, IndexShift, Size, IsKernelCreatedFromSource, | |
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD); | |
++IndexShift; | |
detail::AccessorBaseHost *GOffsetBase = | |
static_cast<detail::AccessorBaseHost *>(&S->GlobalOffset); | |
detail::AccessorImplPtr GOfssetImpl = detail::getSyclObjImpl(*GOffsetBase); | |
detail::Requirement *GOffsetReq = GOfssetImpl.get(); | |
addArgsForGlobalAccessor( | |
GOffsetReq, Index, IndexShift, Size, IsKernelCreatedFromSource, | |
impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD); | |
++IndexShift; | |
detail::AccessorBaseHost *GFlushBase = | |
static_cast<detail::AccessorBaseHost *>(&S->GlobalFlushBuf); | |
detail::AccessorImplPtr GFlushImpl = detail::getSyclObjImpl(*GFlushBase); | |
detail::Requirement *GFlushReq = GFlushImpl.get(); | |
size_t GlobalSize = impl->MNDRDesc.GlobalSize.size(); | |
// If work group size wasn't set explicitly then it must be recieved | |
// from kernel attribute or set to default values. | |
// For now we can't get this attribute here. | |
// So we just suppose that WG size is always default for stream. | |
// TODO adjust MNDRDesc when device image contains kernel's attribute | |
if (GlobalSize == 0) { | |
// Suppose that work group size is 1 for every dimension | |
GlobalSize = impl->MNDRDesc.NumWorkGroups.size(); | |
} | |
addArgsForGlobalAccessor(GFlushReq, Index, IndexShift, Size, | |
IsKernelCreatedFromSource, GlobalSize, impl->MArgs, | |
IsESIMD); | |
++IndexShift; | |
addArg(kernel_param_kind_t::kind_std_layout, &S->FlushBufferSize, | |
sizeof(S->FlushBufferSize), Index + IndexShift); | |
break; |
It looks like to me that it is expected that it can be the case the GlobalSize
is zero. This means that addArgsForGlobalAccessor
are called with the size argument set to zero and then later on GlobalSize
is checked if it is zero and it it is it is set to the size of NumWorkGroups
.
I am not quite sure how this is working. I would have expected there to be issues passing size of zero to addArgsForGlobalAccessor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other place that NumWorkGroups
is used is in adjectNDRangePerKernel
in sycl/source/detail/scheduler/commands.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because AccImpl->PerWI
just happens to be false so GlobalSize
is not used. Not sure what that variable is meant to signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try using a single variable, and see if anything breaks?
If the other parts of the code are checking for zero GlobalSize
and then reading NumWorkGroups
instead, it seems like you could just remove the check and read GlobalSize
unconditionally,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem to break things. Likely what is happening is if GlobalSize
is all zeros then it is implicitly implied that GlobalSize
needs to be modifed. Such as in adjustNDRangePerKernel
where GlobalSize
is checked if zero and if so, GlobalSize
is set to work group size * NumWorkGroups
.
A lot of very annoying side effects going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also this comment near the bottom of this class:
/// Number of workgroups, used to record the number of workgroups from the
/// simplest form of parallel_for_work_group. If set, all other fields must be
/// zero
std::array<size_t, 3> NumWorkGroups{0, 0, 0};
std::array<size_t, 3> ClusterDimensions{1, 1, 1};
… extra dimensions to zero or one respectively weather LocalSizes is zero or not respectively
} | ||
|
||
for (int I = Dims_; I < 3; ++I) { | ||
LocalSize[I] = LocalSizes[0] ? 1 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of tests that depend on extra LocalSize dimensions higher than Dims_ being set to zero or one depending on whether LocalSizes[I] is zero or not respectively. RequiredWGSize.NoRequiredSize and RequiredWGSize.HasRequiredSize always fail if extra LocalSize dimensions are always set to 1 and various tests such as work_group_size_prop.cpp
and six others fail if extra LocalSize dimensions are always set to zero. This preserves the old behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to me that this was introduced in the first place. It really should not matter what the value of dimensions higher than Dims_ are and should just be ignored. But now a number of tests depend on this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO to revisit this?
This sort of complexity will have a (small) impact on runtime, but it's also going to make it harder to make changes to NDRDescT
later on. Making sure NDRDescT
returns values we can't explain just to satisfy existing tests is one way to proceed -- but we could also look into whether those tests are actually useful, or rewrite them (and related functionality) to do the right thing.
… by setting extra dimension values to zero when using spercific constructor
…onstructor is used in NDRDescT
Working on improving performance with this PR has lead me to hopefully make it more explicit in what values the members of Not looking to fix the root cause of that here but to at least hopefully make it more obvious what is going on. |
Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the |
Change them to |
Looks to be from this extension: sycl/doc/extensions/proposed/sycl_ext_codeplay_cuda_cluster_group.asciidoc I do not see any requirement to throw so asserts should be fine. |
sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.