[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851

DBDuncan · 2025-06-06T15:29:01Z

sycl::range and sycl::id perform validity checks every time setting them. Use std::array instead as dimensions should already be valid. In addition, remove explicitly padding dimensions smaller than 3 and get number of dimensions from template argument instead of function argument.

aelovikov-intel

Can we remove the throw from these SYCL classes instead?

aelovikov-intel · 2025-06-06T15:46:57Z

sycl/test/abi/sycl_symbols_linux.dump

@@ -3154,13 +3162,11 @@ _ZN4sycl3_V15queue20wait_and_throw_proxyERKNS0_6detail13code_locationE
 _ZN4sycl3_V15queue22memcpyFromDeviceGlobalEPvPKvbmmRKSt6vectorINS0_5eventESaIS6_EE
 _ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_14SubmissionInfoERKNS2_13code_locationEb
 _ZN4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb
-_ZNK4sycl3_V15queue22submit_with_event_implERKNS0_6detail19type_erased_cgfo_tyERKNS2_2v114SubmissionInfoERKNS2_13code_locationEb


I know it wasn't you who messed up the sorting here, but please either remove unnecessary changes or clean it up with a preceding PR to just restore the sorting.

Pennycook · 2025-06-09T14:15:16Z

sycl/source/detail/cg.hpp

+  NDRDescT(sycl::range<Dims_> N, bool SetNumWorkGroups) : Dims{size_t(Dims_)} {
+    if (SetNumWorkGroups) {
+      for (size_t I = 0; I < Dims_; ++I) {
+        NumWorkGroups[I] = N[I];
+      }
+    } else {
+      for (size_t I = 0; I < Dims_; ++I) {
+        GlobalSize[I] = N[I];
+      }
+    }


This looks really weird to me. I know you didn't introduce this SetNumWorkGroups thing, but it's odd.

From a quick glance, it looks like:

We always store the range passed to the constructor, but potentially in different places.

NumWorkGroups is only used by hierarchical parallelism (parallel_for_work_group, specifically).

Could we flip the logic here, so that the constructor always unconditionally stores into GlobalSize, and the parallel_for_work_group code knows to read GlobalSize instead of NumWorkGroups?

I have been looking into this. There is some confusing stuff going on in the unmodified version of handler.cpp:

llvm/sycl/source/handler.cpp

Lines 1037 to 1079 in 22c8d2f

case kernel_param_kind_t::kind_stream: {

// Stream contains several accessors inside.

stream *S = static_cast<stream *>(Ptr);

detail::AccessorBaseHost *GBufBase =

static_cast<detail::AccessorBaseHost *>(&S->GlobalBuf);

detail::AccessorImplPtr GBufImpl = detail::getSyclObjImpl(*GBufBase);

detail::Requirement *GBufReq = GBufImpl.get();

addArgsForGlobalAccessor(

GBufReq, Index, IndexShift, Size, IsKernelCreatedFromSource,

impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);

++IndexShift;

detail::AccessorBaseHost *GOffsetBase =

static_cast<detail::AccessorBaseHost *>(&S->GlobalOffset);

detail::AccessorImplPtr GOfssetImpl = detail::getSyclObjImpl(*GOffsetBase);

detail::Requirement *GOffsetReq = GOfssetImpl.get();

addArgsForGlobalAccessor(

GOffsetReq, Index, IndexShift, Size, IsKernelCreatedFromSource,

impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);

++IndexShift;

detail::AccessorBaseHost *GFlushBase =

static_cast<detail::AccessorBaseHost *>(&S->GlobalFlushBuf);

detail::AccessorImplPtr GFlushImpl = detail::getSyclObjImpl(*GFlushBase);

detail::Requirement *GFlushReq = GFlushImpl.get();

size_t GlobalSize = impl->MNDRDesc.GlobalSize.size();

// If work group size wasn't set explicitly then it must be recieved

// from kernel attribute or set to default values.

// For now we can't get this attribute here.

// So we just suppose that WG size is always default for stream.

// TODO adjust MNDRDesc when device image contains kernel's attribute

if (GlobalSize == 0) {

// Suppose that work group size is 1 for every dimension

GlobalSize = impl->MNDRDesc.NumWorkGroups.size();

}

addArgsForGlobalAccessor(GFlushReq, Index, IndexShift, Size,

IsKernelCreatedFromSource, GlobalSize, impl->MArgs,

IsESIMD);

++IndexShift;

addArg(kernel_param_kind_t::kind_std_layout, &S->FlushBufferSize,

sizeof(S->FlushBufferSize), Index + IndexShift);

break;

It looks like to me that it is expected that it can be the case the GlobalSize is zero. This means that addArgsForGlobalAccessor are called with the size argument set to zero and then later on GlobalSize is checked if it is zero and it it is it is set to the size of NumWorkGroups.

I am not quite sure how this is working. I would have expected there to be issues passing size of zero to addArgsForGlobalAccessor.

The only other place that NumWorkGroups is used is in adjectNDRangePerKernel in sycl/source/detail/scheduler/commands.cpp

I think it is because AccImpl->PerWI just happens to be false so GlobalSize is not used. Not sure what that variable is meant to signal.

Can you try using a single variable, and see if anything breaks?

If the other parts of the code are checking for zero GlobalSize and then reading NumWorkGroups instead, it seems like you could just remove the check and read GlobalSize unconditionally,

It does seem to break things. Likely what is happening is if GlobalSize is all zeros then it is implicitly implied that GlobalSize needs to be modifed. Such as in adjustNDRangePerKernel where GlobalSize is checked if zero and if so, GlobalSize is set to work group size * NumWorkGroups.

A lot of very annoying side effects going on.

There is also this comment near the bottom of this class:

/// Number of workgroups, used to record the number of workgroups from the /// simplest form of parallel_for_work_group. If set, all other fields must be /// zero std::array<size_t, 3> NumWorkGroups{0, 0, 0}; std::array<size_t, 3> ClusterDimensions{1, 1, 1};

… extra dimensions to zero or one respectively weather LocalSizes is zero or not respectively

DBDuncan · 2025-06-11T15:01:23Z

sycl/source/detail/cg.hpp

+    }
+
+    for (int I = Dims_; I < 3; ++I) {
+      LocalSize[I] = LocalSizes[0] ? 1 : 0;


There are a number of tests that depend on extra LocalSize dimensions higher than Dims_ being set to zero or one depending on whether LocalSizes[I] is zero or not respectively. RequiredWGSize.NoRequiredSize and RequiredWGSize.HasRequiredSize always fail if extra LocalSize dimensions are always set to 1 and various tests such as work_group_size_prop.cpp and six others fail if extra LocalSize dimensions are always set to zero. This preserves the old behaviour.

It seems strange to me that this was introduced in the first place. It really should not matter what the value of dimensions higher than Dims_ are and should just be ignored. But now a number of tests depend on this behaviour.

Can we add a TODO to revisit this?

This sort of complexity will have a (small) impact on runtime, but it's also going to make it harder to make changes to NDRDescT later on. Making sure NDRDescT returns values we can't explain just to satisfy existing tests is one way to proceed -- but we could also look into whether those tests are actually useful, or rewrite them (and related functionality) to do the right thing.

…s called on sycl::range

… by setting extra dimension values to zero when using spercific constructor

…onstructor is used in NDRDescT

DBDuncan · 2025-06-13T17:52:33Z

Working on improving performance with this PR has lead me to hopefully make it more explicit in what values the members of NDRDescT are set to when constructed. There are a lot of tests that rely on 1 or 0 to be set for dimensions that are not used and other behaviour I have had to preserve to get the CI to pass.

Not looking to fix the root cause of that here but to at least hopefully make it more obvious what is going on.

DBDuncan · 2025-06-16T13:19:50Z

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

aelovikov-intel · 2025-06-16T14:33:42Z

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

Change them to asserts maybe, if the spec doesn't require them. Or move up the callstack if the spec says the users of these classes must throw.

…s constructor

…roups is false

DBDuncan · 2025-06-17T15:24:10Z

Can we remove the throw from these SYCL classes instead?

Sorry, to clarify, do you mean to remove the throws as well @aelovikov-intel ? Or remove the throw's instead of something else?

Change them to asserts maybe, if the spec doesn't require them. Or move up the callstack if the spec says the users of these classes must throw.

Looks to be from this extension: sycl/doc/extensions/proposed/sycl_ext_codeplay_cuda_cluster_group.asciidoc

I do not see any requirement to throw so asserts should be fine.

DBDuncan requested a review from a team as a code owner June 6, 2025 15:29

DBDuncan requested a review from aelovikov-intel June 6, 2025 15:29

DBDuncan had a problem deploying to WindowsCILock June 6, 2025 15:29 — with GitHub Actions Error

Format code

520e446

DBDuncan temporarily deployed to WindowsCILock June 6, 2025 15:50 — with GitHub Actions Inactive

DBDuncan had a problem deploying to WindowsCILock June 6, 2025 16:13 — with GitHub Actions Failure

aelovikov-intel reviewed Jun 6, 2025

View reviewed changes

Pennycook reviewed Jun 9, 2025

View reviewed changes

DBDuncan added 2 commits June 11, 2025 15:58

Improve modification of NDRDescT in adjustNDRangePerKernel

907717c

Fix bug when setting LocalSize by preserving old behaviour of setting…

adafe3d

… extra dimensions to zero or one respectively weather LocalSizes is zero or not respectively

DBDuncan had a problem deploying to WindowsCILock June 11, 2025 15:00 — with GitHub Actions Error

DBDuncan commented Jun 11, 2025

View reviewed changes

Format and remove mistakenly committed code

ef58ba7

DBDuncan temporarily deployed to WindowsCILock June 11, 2025 15:05 — with GitHub Actions Inactive

DBDuncan had a problem deploying to WindowsCILock June 11, 2025 15:31 — with GitHub Actions Failure

DBDuncan added 2 commits June 12, 2025 13:44

Merge remote-tracking branch 'origin/sycl' into duncan/ndrange-perf-fix

86d7783

Fix issues with .size() being called on std::array when previously wa…

7d4175f

…s called on sycl::range

DBDuncan had a problem deploying to WindowsCILock June 12, 2025 14:27 — with GitHub Actions Error

swap int with size_t

4fe9507

DBDuncan temporarily deployed to WindowsCILock June 12, 2025 15:07 — with GitHub Actions Inactive

DBDuncan had a problem deploying to WindowsCILock June 12, 2025 15:56 — with GitHub Actions Error

DBDuncan temporarily deployed to WindowsCILock June 12, 2025 15:56 — with GitHub Actions Inactive

Set GlobalRange default value to 1

11fdc89

DBDuncan temporarily deployed to WindowsCILock June 12, 2025 16:22 — with GitHub Actions Inactive

DBDuncan had a problem deploying to WindowsCILock June 12, 2025 20:34 — with GitHub Actions Failure

Preserve previous behaviour to get HierPar/hier_par_basic.cpp to pass…

19e8982

… by setting extra dimension values to zero when using spercific constructor

DBDuncan temporarily deployed to WindowsCILock June 13, 2025 10:45 — with GitHub Actions Inactive

DBDuncan had a problem deploying to WindowsCILock June 13, 2025 11:33 — with GitHub Actions Failure

Preserve old behaviour of GlobalSize being set to zero when default c…

73b8e4d

…onstructor is used in NDRDescT

DBDuncan had a problem deploying to WindowsCILock June 13, 2025 17:47 — with GitHub Actions Error

Remove commented out code

889b4d7

DBDuncan temporarily deployed to WindowsCILock June 13, 2025 17:55 — with GitHub Actions Inactive

DBDuncan temporarily deployed to WindowsCILock June 13, 2025 18:26 — with GitHub Actions Inactive

DBDuncan had a problem deploying to WindowsCILock June 13, 2025 18:26 — with GitHub Actions Failure

remove setting extra global size dims to 1 when using SetNumWorkGroup…

9964dcf

…s constructor

DBDuncan temporarily deployed to WindowsCILock June 16, 2025 15:52 — with GitHub Actions Inactive

DBDuncan temporarily deployed to WindowsCILock June 16, 2025 16:27 — with GitHub Actions Inactive

Reintroduce setting extra global size dims to 1 only when SetNumWorkG…

6c51413

…roups is false

DBDuncan temporarily deployed to WindowsCILock June 16, 2025 16:54 — with GitHub Actions Inactive

DBDuncan temporarily deployed to WindowsCILock June 16, 2025 17:57 — with GitHub Actions Inactive

	case kernel_param_kind_t::kind_stream: {
	// Stream contains several accessors inside.
	stream S = static_cast<stream >(Ptr);

	detail::AccessorBaseHost *GBufBase =
	static_cast<detail::AccessorBaseHost *>(&S->GlobalBuf);
	detail::AccessorImplPtr GBufImpl = detail::getSyclObjImpl(*GBufBase);
	detail::Requirement *GBufReq = GBufImpl.get();
	addArgsForGlobalAccessor(
	GBufReq, Index, IndexShift, Size, IsKernelCreatedFromSource,
	impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);
	++IndexShift;
	detail::AccessorBaseHost *GOffsetBase =
	static_cast<detail::AccessorBaseHost *>(&S->GlobalOffset);
	detail::AccessorImplPtr GOfssetImpl = detail::getSyclObjImpl(*GOffsetBase);
	detail::Requirement *GOffsetReq = GOfssetImpl.get();
	addArgsForGlobalAccessor(
	GOffsetReq, Index, IndexShift, Size, IsKernelCreatedFromSource,
	impl->MNDRDesc.GlobalSize.size(), impl->MArgs, IsESIMD);
	++IndexShift;
	detail::AccessorBaseHost *GFlushBase =
	static_cast<detail::AccessorBaseHost *>(&S->GlobalFlushBuf);
	detail::AccessorImplPtr GFlushImpl = detail::getSyclObjImpl(*GFlushBase);
	detail::Requirement *GFlushReq = GFlushImpl.get();

	size_t GlobalSize = impl->MNDRDesc.GlobalSize.size();
	// If work group size wasn't set explicitly then it must be recieved
	// from kernel attribute or set to default values.
	// For now we can't get this attribute here.
	// So we just suppose that WG size is always default for stream.
	// TODO adjust MNDRDesc when device image contains kernel's attribute
	if (GlobalSize == 0) {
	// Suppose that work group size is 1 for every dimension
	GlobalSize = impl->MNDRDesc.NumWorkGroups.size();
	}
	addArgsForGlobalAccessor(GFlushReq, Index, IndexShift, Size,
	IsKernelCreatedFromSource, GlobalSize, impl->MArgs,
	IsESIMD);
	++IndexShift;
	addArg(kernel_param_kind_t::kind_std_layout, &S->FlushBufferSize,
	sizeof(S->FlushBufferSize), Index + IndexShift);

	break;

[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851

Are you sure you want to change the base?

[SYCL] Optimize NDRDescT by removing sycl::range, sycl::id and padding #18851

Conversation

DBDuncan commented Jun 6, 2025

Uh oh!

aelovikov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DBDuncan Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DBDuncan Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DBDuncan commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DBDuncan commented Jun 16, 2025

Uh oh!

aelovikov-intel commented Jun 16, 2025

Uh oh!

DBDuncan commented Jun 17, 2025

Uh oh!

Uh oh!

DBDuncan Jun 12, 2025 •

edited

Loading

DBDuncan Jun 17, 2025 •

edited

Loading

DBDuncan commented Jun 13, 2025 •

edited

Loading