Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][DOC] Add phase 2 design for "if_device_has" #9127

Open
wants to merge 3 commits into
base: sycl
Choose a base branch
from

Conversation

gmlueck
Copy link
Contributor

@gmlueck gmlueck commented Apr 19, 2023

Add the "phase 2" design for if_device_has and if_architecture_is extensions. Whereas the phase 1 design worked only in some limited cases for AOT compilation, the phase 2 design fully supports both of these extensions in both AOT and JIT modes.

The phase 2 design is also implementable in a 1-pass compiler.

Note that the design document presents a primary design and also two alternatives. Discussion of the alternatives is encouraged in this PR.

Attention @intel/dpcpp-spirv-doc-reviewers: The SPIR-V extension specification in this PR is used only in one of the alternate designs. Although comments on the SPIR-V spec are welcome, it may make sense to decide first whether the alternate design makes sense before doing a detailed review of the SPIR-V spec.

Add the "phase 2" design for `if_device_has` and `if_architecture_is`
extensions.  Whereas the phase 1 design worked only in some limited
cases for AOT compilation, the phase 2 design fully supports both of
these extensions in both AOT and JIT modes.

The phase 2 design is also implementable in a 1-pass compiler.

Note that the design document presents a primary design and also two
alternatives.  Discussion of the alternatives is encouraged in this PR.

Attention @intel/dpcpp-spirv-doc-reviewers: The SPIR-V extension
specification in this PR is used only in one of the alternate designs.
Although comments on the SPIR-V spec are welcome, it may make sense to
decide first whether the alternate design makes sense before doing a
detailed review of the SPIR-V spec.
@gmlueck gmlueck requested review from a team as code owners April 19, 2023 21:03
Copy link
Contributor

@AerialMantis AerialMantis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design looks good to me, left a few comments for things to potentially clarify.

being a function pointer. This callee is the _Conditional Action_ function
_FAction_.

* For each call to _FCaller_, the pass adds a new parameter at the beginning of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say _FAction_ here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The pass adds a new parameter to each call to FCaller. That new parameter is the literal function name FAction.

The calls to FAction are in the body of FCaller, and this body is removed (see the next bullet point).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused by this bullet. If we pass an extra argument to FCaller which is "literal function name", then why are we updating FCaller definition to have a pointer-to-function argument?

Shouldn't arguments match between call site and the definition? Am I misunderstanding what "literal function name" means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "literal function name" isn't the right term. I mean that the call site should be changed to look like this:

call void @call_if_on_device_conditionallyXXX(@CallableXXX, ...)

where @CallableXXX is a function defined in the same LLVM IR module.

In order for this to be correct IR, the declaration of @call_if_on_device_conditionallyXXX needs to be changed so that its first argument is pointer-to-function. When I originally write this PR, I tried these IR transformation by hand, so I'm pretty sure about the parameter type.


* If the target is SPIR-V:

* It moves the definition of `@CallableXXX` and its entire call tree to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the call tree of the callable includes functions also called from outside of the callable's call tree, would those functions need to be duplicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is described in the detailed section below titled "Changes to the sycl-post-link tool (non-AOT)".

* The call to _FCallerB_ has _Conditional Action_ named _FActionB_ and
_Condition Expression_ named _ExpB_.

When this happens, aspects used by _FActionA_ have the condition _ExpA_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the conditional expressions are combined like this would it retain separate relationships between the conditions and the aspects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "separate relationship"? The next section describes the information in the !sycl_conditionally_used_aspects metadata. Each set of aspects has an associated condition. Does that answer your question?

functions as described in [Device Code Dynamic Linking][6]. This algorithm is
extended to look also for _Add On Images_.

If the _Main Image_ contains the "SYCL/add on images" property set, the runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this handle nested add-on images, would these all be contained in the same property set, does the linking order here matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested add-on images are handled in the last bullet point below that starts "The selected Add On Image may also contain a "SYCL/add on images" property set".

The algorithm in sycl-post-link breaks nested if_device_has calls into separate add-on images. Each of those add-on images contains its own "SYCL/device requirements" property set that defines the aspects it uses.

The link order doesn't matter.

resolving the _Conditional Actions_ and run the LLVM IR pipeline separately for
each copy of the IR.

It would be possible to use this alternate design for some AOT targets but not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design would be preferable for Nvidia and AMD targets as there are many optimizations such as inlining which are crucial for performance which I understand would not be run across the conditional caller and conditional action boundary. I think the benefit of the additional optimizations would be worth the higher compilation times when targeting multiple AoT targets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems reasonable to me. This is what I had in mind as a possibility when writing this section.

being a function pointer. This callee is the _Conditional Action_ function
_FAction_.

* For each call to _FCaller_, the pass adds a new parameter at the beginning of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused by this bullet. If we pass an extra argument to FCaller which is "literal function name", then why are we updating FCaller definition to have a pointer-to-function argument?

Shouldn't arguments match between call site and the definition? Am I misunderstanding what "literal function name" means?

This IR pass is changed to perform the following additional aspect
propagations:

* Aspects used by each _Conditional Action_ function (and by functions it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that we don't have any special markup on Conditional Action? I'm asking because currently we don't really do bottom-up propagation to set, but instead we do top-bottom call graph analysis starting from kernels.

Therefore, with the current algorithm we will mostly ignore Conditional Actions. We can rewrite it to bottom-up, but the question is where to we add the metadata - to each node in a call graph? or only to top-level entities such as kernels? or to any functions which don't have callers within a module?

This is all solvable, I'm noting it here, because it would be good to avoid that to prevent unnecessary bloat of LLVM IR modules we produce by placing that metadata everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kernel-bundle based approach would allow us to keep top-down approach with some extra handling/resets in handling of these nodes.

Comment on lines +869 to +872
* When determining whether two kernels can be placed in the same device image,
the `!used_aspects` must be the same and the
`!sycl_conditionally_used_aspects` must be the same (the same set of
conditions and the same set of conditionally used aspects).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to take !sycl_conditionally_used_aspects into account? Since actual uses of those conditionally used aspects will optionally be linked later at runtime, it should be safe to bundle two kernels which directly use the same aspects together.

Comment on lines +785 to +793
* Aspect usage can be propagated through nested _Conditional Caller_ function
calls. To illustrate, consider the following example:

* A _Conditional Caller_ named _FCallerA_ has the _Conditional Action_ named
_FActionA_ and the _Condition Expression_ named _ExpA_.
* The function _FActionA_ calls a different _Conditional Caller_ named
_FCallerB_.
* The call to _FCallerB_ has _Conditional Action_ named _FActionB_ and
_Condition Expression_ named _ExpB_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, both FActionA and FActionB will at the end reside each in their own device image with individual requirements - do we even need then to propagate conditionally used aspects through nested conditional calls?

Comment on lines +959 to +961
* Each pair of _Add On Images_ (i.e. the one with the real function
definitions and the one with the stub definitions) is assigned a unique
name. By convention this is just an integer in string form (e.g. "1").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple integer may not be enough here, because application may be linked to a shared library which also contains SYCL kernels and if_device_has calls, thus both the app and the library will provide add-on device images with the same integer string "1". We probably need to include GUID in the name as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the "scope" of the Add On Image numbers was limited to the ELF file that contains them. For example, if a Main Image contains a reference to Add On Image "1", we only search for that Add On Image in the same ELF file.

If you think this will be difficult and we need a globally unique name, we could use a GUID. However, I think the driver will need to pass a GUID via a command line option to sycl-post-link.

contains one property for each of its _Add On Images_. The name of each
property is a unique identifier for the _Add On Image_, which by convention is
just an integer in string form (e.g "1"). The value of the property has type
`PI_PROPERTY_TYPE_BYTE_ARRAY` containing a series of `uint32` values `N1`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be a signed integer? Encoding for logical operators and "keywords" is defined as negative numbers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Changed in 01fa4f8.

properties to the iteration set, causing their _Add On Images_ to be found
also.

Once this completes, the runtime computes the union of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also think about in-memory and on-disk caches that we have. I don't remember what do we have there as keys, but as long as device is in there we probably don't need any extra changes since results of all conditional expressions should be the same if evaluated more than once for the same device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense.

Comment on lines +1337 to +1342
The disadvantage to this design is that it increases compilation time when
there are multiple targets. Once the _Conditional Actions_ are resolved,
the LLVM IR is now specialized for one particular AOT target. If the user has
asked to compile for multiple targets, we need to split the IR prior to
resolving the _Conditional Actions_ and run the LLVM IR pipeline separately for
each copy of the IR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is what we currently do anyway: each AOT target (like spir64_gen or spir64_x86_64) causes a separate invocation of FE and thus separate optimization pipeline invocation. so, this won't be worse than what we have today.

But at the same time, we currently have only a few AOT targets because they are generic and do not represent a particular device. Once we switch to specialized targets like spir64_intel_skl running FE separately for each of them would be a disaster and we better not lock-in ourselves into such situation.

What we could do is a single FE invocation and then several opt invocations for each target to at least somehow mitigate this.

Alternatively, we could just leave a single IR for all targets, but be more tricky about the IR we emit: instead of dropping bodies of call_if_on_device_conditionallyXXX we could replace their bodies to get the following:

define void @call_if_on_device_conditionallyXXX(%callablethis, %n1, %n2, ...) {
  %bool = __sycl_builtin_if_on_device.mangling(%n1, %n2, ...)
  br %bool, label %do_call, label %exit
do_call:
  call void @CallableXXX(%callablethis)
exit:
  ret void
}

This would be an alternative transformation which the new pass does for AOT targets. Yes, it won't allow to full optimizations and may still limit some inlining, but it is better than nothing and allows us to have unified IR for all AOT targets in the optimization pipeline. Later, in sycl-post-link we would resolve that __sycl_builtin_if_on_device.* into concrete boolean value based on the target device and run some DCE to cleanup dead branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we switch to specialized targets like spir64_intel_skl running FE separately for each of them would be a disaster and we better not lock-in ourselves into such situation.

Yes, this is what I was worried about when describing the disadvantages of this design.

What we could do is a single FE invocation and then several opt invocations for each target to at least somehow mitigate this.

Yes, this is what I meant by "split the IR prior to resolving the Conditional Actions and run the LLVM IR pipeline separately for each copy of the IR".

Alternatively, we could just leave a single IR for all targets, but be more tricky about the IR we emit: instead of dropping bodies of call_if_on_device_conditionallyXXX we could replace their bodies to get the following

I'm not sure this is a good idea. I'm worried that the body of @CallableXXX will get inlined and then some instructions from its body will be lifted above the br %bool. I saw some cases like this when experimenting with a solution like this. The nice thing about the primary design is that there is no call to @CallableXXX in the IR module, so it cannot possibly be inlined, thus avoiding the possibility of "leaking" any of its instructions above the condition check.

My current thinking is that:

  • We should use this alternate design for Nvidia targets because these targets run the FE separately for each target anyway.
  • We should use the primary design for Intel GPU targets and see if ocloc does a good job of inlining the calls to @CallableXXX and optimizing the result.

If ocloc optimizations are not good, we could try the alternate design, where we split the IR and run optimization passes separately for each Intel GPU target.

[7]: <./OptionalDeviceFeatures.md#changes-to-the-dpc-runtime>


## Alternate design for non-AOT SPIR-V targets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm in favor of the main proposed design for non-AOT SPIR-V targets, because similar idea is likely to be used by virtual functions extension (#10540) and therefore we will have more code which is handled in a similar manner and less things to support.

Raising a bar for underlying backends is not a huge concern in here, because we are talking about an extension and not a core functionality. However, if we think that this would be promoted to core SYCL at some point, then it is maybe better to avoid requiring a SPIR-V extension for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree (somewhat). I think we should be using the same implementation for specialization constants and for this (and possibly for virtual functions). The features are very similar in what they do/can provide. It makes no sense to take different code paths. I also think that an ideal implementation would be mostly library-only on top of specialization constants if they are powerful enough.

The property set that contains the condition expression must contain
signed values because the condition operators are negative numbers.
@gmlueck gmlueck requested a review from a team as a code owner November 29, 2023 22:24
Copy link
Contributor Author

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and comments. Some responses below. I'll answer the remaining comments tomorrow.

being a function pointer. This callee is the _Conditional Action_ function
_FAction_.

* For each call to _FCaller_, the pass adds a new parameter at the beginning of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "literal function name" isn't the right term. I mean that the call site should be changed to look like this:

call void @call_if_on_device_conditionallyXXX(@CallableXXX, ...)

where @CallableXXX is a function defined in the same LLVM IR module.

In order for this to be correct IR, the declaration of @call_if_on_device_conditionallyXXX needs to be changed so that its first argument is pointer-to-function. When I originally write this PR, I tried these IR transformation by hand, so I'm pretty sure about the parameter type.

Comment on lines +1337 to +1342
The disadvantage to this design is that it increases compilation time when
there are multiple targets. Once the _Conditional Actions_ are resolved,
the LLVM IR is now specialized for one particular AOT target. If the user has
asked to compile for multiple targets, we need to split the IR prior to
resolving the _Conditional Actions_ and run the LLVM IR pipeline separately for
each copy of the IR.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we switch to specialized targets like spir64_intel_skl running FE separately for each of them would be a disaster and we better not lock-in ourselves into such situation.

Yes, this is what I was worried about when describing the disadvantages of this design.

What we could do is a single FE invocation and then several opt invocations for each target to at least somehow mitigate this.

Yes, this is what I meant by "split the IR prior to resolving the Conditional Actions and run the LLVM IR pipeline separately for each copy of the IR".

Alternatively, we could just leave a single IR for all targets, but be more tricky about the IR we emit: instead of dropping bodies of call_if_on_device_conditionallyXXX we could replace their bodies to get the following

I'm not sure this is a good idea. I'm worried that the body of @CallableXXX will get inlined and then some instructions from its body will be lifted above the br %bool. I saw some cases like this when experimenting with a solution like this. The nice thing about the primary design is that there is no call to @CallableXXX in the IR module, so it cannot possibly be inlined, thus avoiding the possibility of "leaking" any of its instructions above the condition check.

My current thinking is that:

  • We should use this alternate design for Nvidia targets because these targets run the FE separately for each target anyway.
  • We should use the primary design for Intel GPU targets and see if ocloc does a good job of inlining the calls to @CallableXXX and optimizing the result.

If ocloc optimizations are not good, we could try the alternate design, where we split the IR and run optimization passes separately for each Intel GPU target.

properties to the iteration set, causing their _Add On Images_ to be found
also.

Once this completes, the runtime computes the union of the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense.

contains one property for each of its _Add On Images_. The name of each
property is a unique identifier for the _Add On Image_, which by convention is
just an integer in string form (e.g "1"). The value of the property has type
`PI_PROPERTY_TYPE_BYTE_ARRAY` containing a series of `uint32` values `N1`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Changed in 01fa4f8.

Comment on lines +959 to +961
* Each pair of _Add On Images_ (i.e. the one with the real function
definitions and the one with the stub definitions) is assigned a unique
name. By convention this is just an integer in string form (e.g. "1").
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the "scope" of the Add On Image numbers was limited to the ELF file that contains them. For example, if a Main Image contains a reference to Add On Image "1", we only search for that Add On Image in the same ELF file.

If you think this will be difficult and we need a globally unique name, we could use a GUID. However, I think the driver will need to pass a GUID via a command line option to sycl-post-link.

gmlueck added a commit to gmlueck/llvm that referenced this pull request Dec 27, 2023
Revise the design for handling optional kernel features for AOT
compilations.  There is no need to create a separate tool
"aspect-filter" because the filtering can be done in the
"sycl-post-link" tool instead.  This is better aligned with our
long term direction because "sycl-post-link" will need to do some
AOT-specific IR transformations in the future as part of the design
presented in intel#9127, and those transformation will also make use of
the device configuration file.

This commit also addresses some issues that were missed before related
to embedding several AOT-compiled offload bundles into the host
executable.
Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a full understanding of all the things in play here, but gut tells me that we should be making specialization constants more powerful and building this on top of them instead.

It might result in essentially the same implementation underneath but using slightly different terms, but the resulting design (IMO) would be cleaner and leaner.

// Condition is a parameter pack of int's that define a simple expression
// language which tells the set of aspects or architectures that the device
// must have in order to enable the call. See the "Condition*" values below.
template<typename T, typename ...Condition>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be int... Conditions instead?

Comment on lines +373 to +376
call void @call_if_on_device_conditionallyXXX(@CallableXXX, %callablethis, N1, N2, ...)

declare void @call_if_on_device_conditionallyXXX(%callable, %callablethis, %n1, %n2, ...)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a full understanding of this yet, but another approach would be to have some clang builtin that gets lowered to either this directly or to something similar using operand bundles instead (with the same purpose of "defeating" the optimizer).

This IR pass is changed to perform the following additional aspect
propagations:

* Aspects used by each _Conditional Action_ function (and by functions it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kernel-bundle based approach would allow us to keep top-down approach with some extra handling/resets in handling of these nodes.

Comment on lines +830 to +833
To illustrate, consider an example where the _condition_ is "fp16 == true" and
the _aspects_ is "fp16". In such a case, the conditional aspect usage is
uninteresting because any device where "fp16 == true" will definitely support
the "fp16" aspect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident, but this suggests to me that there is a more generic abstraction that could be elegantly used for both purposes somewhat uniformly.

[7]: <./OptionalDeviceFeatures.md#changes-to-the-dpc-runtime>


## Alternate design for non-AOT SPIR-V targets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree (somewhat). I think we should be using the same implementation for specialization constants and for this (and possibly for virtual functions). The features are very similar in what they do/can provide. It makes no sense to take different code paths. I also think that an ideal implementation would be mostly library-only on top of specialization constants if they are powerful enough.

// language which tells the set of aspects or architectures that the device
// must have in order to enable the call. See the "Condition*" values below.
template<typename T, typename ...Condition>
[[__sycl_detail__::add_ir_attributes_function("sycl-call-if-on-device-conditionally", true)]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[[__sycl_detail__::add_ir_attributes_function("sycl-call-if-on-device-conditionally", true)]]
#ifdef __SYCL_DEVICE_ONLY__
[[__sycl_detail__::add_ir_attributes_function(
"sycl-call-if-on-device-conditionally", true)]]
#endif

Copy link
Contributor

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

@github-actions github-actions bot added the Stale label Oct 17, 2024
@gmlueck gmlueck removed the Stale label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants