-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][DOC] Add phase 2 design for "if_device_has" #9127
base: sycl
Are you sure you want to change the base?
Conversation
Add the "phase 2" design for `if_device_has` and `if_architecture_is` extensions. Whereas the phase 1 design worked only in some limited cases for AOT compilation, the phase 2 design fully supports both of these extensions in both AOT and JIT modes. The phase 2 design is also implementable in a 1-pass compiler. Note that the design document presents a primary design and also two alternatives. Discussion of the alternatives is encouraged in this PR. Attention @intel/dpcpp-spirv-doc-reviewers: The SPIR-V extension specification in this PR is used only in one of the alternate designs. Although comments on the SPIR-V spec are welcome, it may make sense to decide first whether the alternate design makes sense before doing a detailed review of the SPIR-V spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design looks good to me, left a few comments for things to potentially clarify.
being a function pointer. This callee is the _Conditional Action_ function | ||
_FAction_. | ||
|
||
* For each call to _FCaller_, the pass adds a new parameter at the beginning of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say _FAction_
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The pass adds a new parameter to each call to FCaller. That new parameter is the literal function name FAction.
The calls to FAction are in the body of FCaller, and this body is removed (see the next bullet point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused by this bullet. If we pass an extra argument to FCaller
which is "literal function name", then why are we updating FCaller
definition to have a pointer-to-function argument?
Shouldn't arguments match between call site and the definition? Am I misunderstanding what "literal function name" means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "literal function name" isn't the right term. I mean that the call site should be changed to look like this:
call void @call_if_on_device_conditionallyXXX(@CallableXXX, ...)
where @CallableXXX
is a function defined in the same LLVM IR module.
In order for this to be correct IR, the declaration of @call_if_on_device_conditionallyXXX
needs to be changed so that its first argument is pointer-to-function. When I originally write this PR, I tried these IR transformation by hand, so I'm pretty sure about the parameter type.
|
||
* If the target is SPIR-V: | ||
|
||
* It moves the definition of `@CallableXXX` and its entire call tree to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if the call tree of the callable includes functions also called from outside of the callable's call tree, would those functions need to be duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is described in the detailed section below titled "Changes to the sycl-post-link tool (non-AOT)".
* The call to _FCallerB_ has _Conditional Action_ named _FActionB_ and | ||
_Condition Expression_ named _ExpB_. | ||
|
||
When this happens, aspects used by _FActionA_ have the condition _ExpA_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the conditional expressions are combined like this would it retain separate relationships between the conditions and the aspects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "separate relationship"? The next section describes the information in the !sycl_conditionally_used_aspects
metadata. Each set of aspects has an associated condition. Does that answer your question?
functions as described in [Device Code Dynamic Linking][6]. This algorithm is | ||
extended to look also for _Add On Images_. | ||
|
||
If the _Main Image_ contains the "SYCL/add on images" property set, the runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this handle nested add-on images, would these all be contained in the same property set, does the linking order here matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested add-on images are handled in the last bullet point below that starts "The selected Add On Image may also contain a "SYCL/add on images" property set".
The algorithm in sycl-post-link
breaks nested if_device_has
calls into separate add-on images. Each of those add-on images contains its own "SYCL/device requirements" property set that defines the aspects it uses.
The link order doesn't matter.
resolving the _Conditional Actions_ and run the LLVM IR pipeline separately for | ||
each copy of the IR. | ||
|
||
It would be possible to use this alternate design for some AOT targets but not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this design would be preferable for Nvidia and AMD targets as there are many optimizations such as inlining which are crucial for performance which I understand would not be run across the conditional caller and conditional action boundary. I think the benefit of the additional optimizations would be worth the higher compilation times when targeting multiple AoT targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this seems reasonable to me. This is what I had in mind as a possibility when writing this section.
being a function pointer. This callee is the _Conditional Action_ function | ||
_FAction_. | ||
|
||
* For each call to _FCaller_, the pass adds a new parameter at the beginning of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused by this bullet. If we pass an extra argument to FCaller
which is "literal function name", then why are we updating FCaller
definition to have a pointer-to-function argument?
Shouldn't arguments match between call site and the definition? Am I misunderstanding what "literal function name" means?
This IR pass is changed to perform the following additional aspect | ||
propagations: | ||
|
||
* Aspects used by each _Conditional Action_ function (and by functions it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that we don't have any special markup on Conditional Action
? I'm asking because currently we don't really do bottom-up propagation to set, but instead we do top-bottom call graph analysis starting from kernels.
Therefore, with the current algorithm we will mostly ignore Conditional Action
s. We can rewrite it to bottom-up, but the question is where to we add the metadata - to each node in a call graph? or only to top-level entities such as kernels? or to any functions which don't have callers within a module?
This is all solvable, I'm noting it here, because it would be good to avoid that to prevent unnecessary bloat of LLVM IR modules we produce by placing that metadata everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think kernel-bundle based approach would allow us to keep top-down approach with some extra handling/resets in handling of these nodes.
* When determining whether two kernels can be placed in the same device image, | ||
the `!used_aspects` must be the same and the | ||
`!sycl_conditionally_used_aspects` must be the same (the same set of | ||
conditions and the same set of conditionally used aspects). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to take !sycl_conditionally_used_aspects
into account? Since actual uses of those conditionally used aspects will optionally be linked later at runtime, it should be safe to bundle two kernels which directly use the same aspects together.
* Aspect usage can be propagated through nested _Conditional Caller_ function | ||
calls. To illustrate, consider the following example: | ||
|
||
* A _Conditional Caller_ named _FCallerA_ has the _Conditional Action_ named | ||
_FActionA_ and the _Condition Expression_ named _ExpA_. | ||
* The function _FActionA_ calls a different _Conditional Caller_ named | ||
_FCallerB_. | ||
* The call to _FCallerB_ has _Conditional Action_ named _FActionB_ and | ||
_Condition Expression_ named _ExpB_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, both FActionA
and FActionB
will at the end reside each in their own device image with individual requirements - do we even need then to propagate conditionally used aspects through nested conditional calls?
* Each pair of _Add On Images_ (i.e. the one with the real function | ||
definitions and the one with the stub definitions) is assigned a unique | ||
name. By convention this is just an integer in string form (e.g. "1"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple integer may not be enough here, because application may be linked to a shared library which also contains SYCL kernels and if_device_has
calls, thus both the app and the library will provide add-on device images with the same integer string "1". We probably need to include GUID in the name as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the "scope" of the Add On Image numbers was limited to the ELF file that contains them. For example, if a Main Image contains a reference to Add On Image "1", we only search for that Add On Image in the same ELF file.
If you think this will be difficult and we need a globally unique name, we could use a GUID. However, I think the driver will need to pass a GUID via a command line option to sycl-post-link
.
sycl/doc/design/DeviceIf.md
Outdated
contains one property for each of its _Add On Images_. The name of each | ||
property is a unique identifier for the _Add On Image_, which by convention is | ||
just an integer in string form (e.g "1"). The value of the property has type | ||
`PI_PROPERTY_TYPE_BYTE_ARRAY` containing a series of `uint32` values `N1`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be a signed integer? Encoding for logical operators and "keywords" is defined as negative numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Changed in 01fa4f8.
properties to the iteration set, causing their _Add On Images_ to be found | ||
also. | ||
|
||
Once this completes, the runtime computes the union of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also think about in-memory and on-disk caches that we have. I don't remember what do we have there as keys, but as long as device is in there we probably don't need any extra changes since results of all conditional expressions should be the same if evaluated more than once for the same device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense.
The disadvantage to this design is that it increases compilation time when | ||
there are multiple targets. Once the _Conditional Actions_ are resolved, | ||
the LLVM IR is now specialized for one particular AOT target. If the user has | ||
asked to compile for multiple targets, we need to split the IR prior to | ||
resolving the _Conditional Actions_ and run the LLVM IR pipeline separately for | ||
each copy of the IR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is what we currently do anyway: each AOT target (like spir64_gen
or spir64_x86_64
) causes a separate invocation of FE and thus separate optimization pipeline invocation. so, this won't be worse than what we have today.
But at the same time, we currently have only a few AOT targets because they are generic and do not represent a particular device. Once we switch to specialized targets like spir64_intel_skl
running FE separately for each of them would be a disaster and we better not lock-in ourselves into such situation.
What we could do is a single FE invocation and then several opt
invocations for each target to at least somehow mitigate this.
Alternatively, we could just leave a single IR for all targets, but be more tricky about the IR we emit: instead of dropping bodies of call_if_on_device_conditionallyXXX
we could replace their bodies to get the following:
define void @call_if_on_device_conditionallyXXX(%callablethis, %n1, %n2, ...) {
%bool = __sycl_builtin_if_on_device.mangling(%n1, %n2, ...)
br %bool, label %do_call, label %exit
do_call:
call void @CallableXXX(%callablethis)
exit:
ret void
}
This would be an alternative transformation which the new pass does for AOT targets. Yes, it won't allow to full optimizations and may still limit some inlining, but it is better than nothing and allows us to have unified IR for all AOT targets in the optimization pipeline. Later, in sycl-post-link
we would resolve that __sycl_builtin_if_on_device.*
into concrete boolean value based on the target device and run some DCE to cleanup dead branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we switch to specialized targets like
spir64_intel_skl
running FE separately for each of them would be a disaster and we better not lock-in ourselves into such situation.
Yes, this is what I was worried about when describing the disadvantages of this design.
What we could do is a single FE invocation and then several
opt
invocations for each target to at least somehow mitigate this.
Yes, this is what I meant by "split the IR prior to resolving the Conditional Actions and run the LLVM IR pipeline separately for each copy of the IR".
Alternatively, we could just leave a single IR for all targets, but be more tricky about the IR we emit: instead of dropping bodies of
call_if_on_device_conditionallyXXX
we could replace their bodies to get the following
I'm not sure this is a good idea. I'm worried that the body of @CallableXXX
will get inlined and then some instructions from its body will be lifted above the br %bool
. I saw some cases like this when experimenting with a solution like this. The nice thing about the primary design is that there is no call to @CallableXXX
in the IR module, so it cannot possibly be inlined, thus avoiding the possibility of "leaking" any of its instructions above the condition check.
My current thinking is that:
- We should use this alternate design for Nvidia targets because these targets run the FE separately for each target anyway.
- We should use the primary design for Intel GPU targets and see if
ocloc
does a good job of inlining the calls to@CallableXXX
and optimizing the result.
If ocloc
optimizations are not good, we could try the alternate design, where we split the IR and run optimization passes separately for each Intel GPU target.
[7]: <./OptionalDeviceFeatures.md#changes-to-the-dpc-runtime> | ||
|
||
|
||
## Alternate design for non-AOT SPIR-V targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'm in favor of the main proposed design for non-AOT SPIR-V targets, because similar idea is likely to be used by virtual functions extension (#10540) and therefore we will have more code which is handled in a similar manner and less things to support.
Raising a bar for underlying backends is not a huge concern in here, because we are talking about an extension and not a core functionality. However, if we think that this would be promoted to core SYCL at some point, then it is maybe better to avoid requiring a SPIR-V extension for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree (somewhat). I think we should be using the same implementation for specialization constants and for this (and possibly for virtual functions). The features are very similar in what they do/can provide. It makes no sense to take different code paths. I also think that an ideal implementation would be mostly library-only on top of specialization constants if they are powerful enough.
The property set that contains the condition expression must contain signed values because the condition operators are negative numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review and comments. Some responses below. I'll answer the remaining comments tomorrow.
being a function pointer. This callee is the _Conditional Action_ function | ||
_FAction_. | ||
|
||
* For each call to _FCaller_, the pass adds a new parameter at the beginning of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "literal function name" isn't the right term. I mean that the call site should be changed to look like this:
call void @call_if_on_device_conditionallyXXX(@CallableXXX, ...)
where @CallableXXX
is a function defined in the same LLVM IR module.
In order for this to be correct IR, the declaration of @call_if_on_device_conditionallyXXX
needs to be changed so that its first argument is pointer-to-function. When I originally write this PR, I tried these IR transformation by hand, so I'm pretty sure about the parameter type.
The disadvantage to this design is that it increases compilation time when | ||
there are multiple targets. Once the _Conditional Actions_ are resolved, | ||
the LLVM IR is now specialized for one particular AOT target. If the user has | ||
asked to compile for multiple targets, we need to split the IR prior to | ||
resolving the _Conditional Actions_ and run the LLVM IR pipeline separately for | ||
each copy of the IR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we switch to specialized targets like
spir64_intel_skl
running FE separately for each of them would be a disaster and we better not lock-in ourselves into such situation.
Yes, this is what I was worried about when describing the disadvantages of this design.
What we could do is a single FE invocation and then several
opt
invocations for each target to at least somehow mitigate this.
Yes, this is what I meant by "split the IR prior to resolving the Conditional Actions and run the LLVM IR pipeline separately for each copy of the IR".
Alternatively, we could just leave a single IR for all targets, but be more tricky about the IR we emit: instead of dropping bodies of
call_if_on_device_conditionallyXXX
we could replace their bodies to get the following
I'm not sure this is a good idea. I'm worried that the body of @CallableXXX
will get inlined and then some instructions from its body will be lifted above the br %bool
. I saw some cases like this when experimenting with a solution like this. The nice thing about the primary design is that there is no call to @CallableXXX
in the IR module, so it cannot possibly be inlined, thus avoiding the possibility of "leaking" any of its instructions above the condition check.
My current thinking is that:
- We should use this alternate design for Nvidia targets because these targets run the FE separately for each target anyway.
- We should use the primary design for Intel GPU targets and see if
ocloc
does a good job of inlining the calls to@CallableXXX
and optimizing the result.
If ocloc
optimizations are not good, we could try the alternate design, where we split the IR and run optimization passes separately for each Intel GPU target.
properties to the iteration set, causing their _Add On Images_ to be found | ||
also. | ||
|
||
Once this completes, the runtime computes the union of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense.
sycl/doc/design/DeviceIf.md
Outdated
contains one property for each of its _Add On Images_. The name of each | ||
property is a unique identifier for the _Add On Image_, which by convention is | ||
just an integer in string form (e.g "1"). The value of the property has type | ||
`PI_PROPERTY_TYPE_BYTE_ARRAY` containing a series of `uint32` values `N1`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Changed in 01fa4f8.
* Each pair of _Add On Images_ (i.e. the one with the real function | ||
definitions and the one with the stub definitions) is assigned a unique | ||
name. By convention this is just an integer in string form (e.g. "1"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the "scope" of the Add On Image numbers was limited to the ELF file that contains them. For example, if a Main Image contains a reference to Add On Image "1", we only search for that Add On Image in the same ELF file.
If you think this will be difficult and we need a globally unique name, we could use a GUID. However, I think the driver will need to pass a GUID via a command line option to sycl-post-link
.
Revise the design for handling optional kernel features for AOT compilations. There is no need to create a separate tool "aspect-filter" because the filtering can be done in the "sycl-post-link" tool instead. This is better aligned with our long term direction because "sycl-post-link" will need to do some AOT-specific IR transformations in the future as part of the design presented in intel#9127, and those transformation will also make use of the device configuration file. This commit also addresses some issues that were missed before related to embedding several AOT-compiled offload bundles into the host executable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a full understanding of all the things in play here, but gut tells me that we should be making specialization constants more powerful and building this on top of them instead.
It might result in essentially the same implementation underneath but using slightly different terms, but the resulting design (IMO) would be cleaner and leaner.
// Condition is a parameter pack of int's that define a simple expression | ||
// language which tells the set of aspects or architectures that the device | ||
// must have in order to enable the call. See the "Condition*" values below. | ||
template<typename T, typename ...Condition> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be int... Conditions
instead?
call void @call_if_on_device_conditionallyXXX(@CallableXXX, %callablethis, N1, N2, ...) | ||
|
||
declare void @call_if_on_device_conditionallyXXX(%callable, %callablethis, %n1, %n2, ...) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a full understanding of this yet, but another approach would be to have some clang builtin that gets lowered to either this directly or to something similar using operand bundles instead (with the same purpose of "defeating" the optimizer).
This IR pass is changed to perform the following additional aspect | ||
propagations: | ||
|
||
* Aspects used by each _Conditional Action_ function (and by functions it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think kernel-bundle based approach would allow us to keep top-down approach with some extra handling/resets in handling of these nodes.
To illustrate, consider an example where the _condition_ is "fp16 == true" and | ||
the _aspects_ is "fp16". In such a case, the conditional aspect usage is | ||
uninteresting because any device where "fp16 == true" will definitely support | ||
the "fp16" aspect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident, but this suggests to me that there is a more generic abstraction that could be elegantly used for both purposes somewhat uniformly.
[7]: <./OptionalDeviceFeatures.md#changes-to-the-dpc-runtime> | ||
|
||
|
||
## Alternate design for non-AOT SPIR-V targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree (somewhat). I think we should be using the same implementation for specialization constants and for this (and possibly for virtual functions). The features are very similar in what they do/can provide. It makes no sense to take different code paths. I also think that an ideal implementation would be mostly library-only on top of specialization constants if they are powerful enough.
// language which tells the set of aspects or architectures that the device | ||
// must have in order to enable the call. See the "Condition*" values below. | ||
template<typename T, typename ...Condition> | ||
[[__sycl_detail__::add_ir_attributes_function("sycl-call-if-on-device-conditionally", true)]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[__sycl_detail__::add_ir_attributes_function("sycl-call-if-on-device-conditionally", true)]] | |
#ifdef __SYCL_DEVICE_ONLY__ | |
[[__sycl_detail__::add_ir_attributes_function( | |
"sycl-call-if-on-device-conditionally", true)]] | |
#endif |
This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days. |
Add the "phase 2" design for
if_device_has
andif_architecture_is
extensions. Whereas the phase 1 design worked only in some limited cases for AOT compilation, the phase 2 design fully supports both of these extensions in both AOT and JIT modes.The phase 2 design is also implementable in a 1-pass compiler.
Note that the design document presents a primary design and also two alternatives. Discussion of the alternatives is encouraged in this PR.
Attention @intel/dpcpp-spirv-doc-reviewers: The SPIR-V extension specification in this PR is used only in one of the alternate designs. Although comments on the SPIR-V spec are welcome, it may make sense to decide first whether the alternate design makes sense before doing a detailed review of the SPIR-V spec.