-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support native code binary representation for XPU backend #2148
Conversation
52f3ad9
to
d444315
Compare
Testing under Torch Inductor in AOT mode I got the following timings for
So, by caching the native code we save about 60ms - but that's for a single kernel, and a typical inductor model may have 100 or more kernels. I've made an additional cleanup pass and rebased on main so I am marking this ready for review. The functionality is under a flag so nothing will change once we merge it - we will need to decide how we want to handle our binary code generation going forward, though. Do we want to generate native code and not SPIRV by default, with the SPIRV generation under a flag? Or do we want to try and only generate native code when Inductor is in AOT mode? And, if we generate native code and not SPIRV, should we change the suffix of the binary file from |
d444315
to
0918b70
Compare
@etiotto @whitneywhtsang this is ready for review. Could one of you please take a look, or suggest a reviewer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some inline comments to address.
fbin = fsrc.name + '.o' | ||
|
||
ocloc_cmd = [ | ||
'ocloc', 'compile', '-file', fsrc.name, '-o', fbin, '-spirv_input', '-device', 'pvc', '-options', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device is hardcoded to PVC. I do not think that would work for older devices. Can we choose the device based on the information PyTorch passes to the compiler ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should. If you are ok with it I will make this a follow-up task though - this is not on by default so I think this behavior is acceptable for now.
gpuAssert(zeKernelGetProperties(l0_kernel, &props)); | ||
n_spills = props.spillMemSize; | ||
std::cout << "(I): Kernel has now " << n_spills << " spills" << std::endl; | ||
if (is_spv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: early exit if is_spv
is false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could put it in a lambda I suppose... but we still need the return code below the branch, right? I think I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also outline the GRF handling to a new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - I was thinking about something similar. I will put that on my list alongside returning the GRF size as # of registers used.
""" | ||
The exact message is something like: | ||
warning: kernel matmul_kernel compiled SIMD16 allocated 128 regs and spilled around 217 | ||
is "spilled" enough for now? | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why stringdoc? Can we # comment
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a multi-line comment. What's the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a multi-line comment in fact, it is a string literal (see). Only comment syntax in Python is #
. Really a NIT, but worth noting.
Cache native code 2/? Cache native code 3/? Cache native code 4/? Cache native code 5/? Cache native code 6/? initialize a private sycl context for compilation Cache native code 7/? port the register spills code Cache native code 8/? less verbose timing logs Cache native code 9/? cleanups, fix flag, do some measuring Cache native code 10/? Use ocloc
5392cc4
to
d1dc02e
Compare
…ode option (#2391) Intel Data Center Max GPUs will dynamically scale the number of hardware threads available per XVE depending on the specified GRF mode. With small GRF mode (default), a single hardware thread can access 128 GRF registers and each XVE engine has 8 hardware threads. In large GRF mode, a single hardware thread can access 256 GRF registers but each XVE engine only has 4 hardware threads. There is also an auto mode. ([see the docs for more info](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-2/small-register-mode-vs-large-register-mode.html)) This PR adds support for populating the `n_regs` parameter returned from loading a binary with information about the selected GRF mode. Because L0 does not return the number of registers and our register size info does not work like NVIDIA, the semantics are a bit different from upstream Triton. We _only_ return a value if the user has specified a small or large GRF mode build flag. The purpose of returning `n_regs` in upstream Triton/Torch Inductor is b/c NVIDIA can dynamically adjust occupancy of a SM based on the register pressure per warp. This means high register pressure can result in fewer running warps which reduces parallelism and performance. Theoretically, you can have many different "GRF modes" on a NVIDIA GPU as you adjust SM occupancy. For Intel GPUs, the choice is binary - large or small - and the performance penalty for register spills in small always outweighs any parallelism gains (at least, in our testing so far). It is not clear that returning 128 is actionable as further reductions in register usage will not effect occupancy - only the large GRF mode effects occupancy. So, I focused on making sure large GRF mode was properly handled and other cases were handled as we were able, with any ambiguous case returning 0 (which will cause torch inductor to skip any register-specific optimization). The approach to returning GRF size is dependent on parsing the build flags passed to the binary loader. Because the build flags are modified in the `make_spv` step during generation of native code instead of a SPIRV file, this approach should work for the native code POC recently merged in #2148. Note that I had to introduce exceptions into our `driver.c` code to make the error handling acceptable. This cleaned up a lot of the code, and I believe should be acceptable both because we already depend on c++ in `driver.c` (just not in the external signatures) and because exceptions are used in other parts of the Triton codebase. I marked this as a draft PR because I would like to do a bit more testing, but it is ready for review. Close #1641
Adds a new command line flag,
TRITON_XPU_GEN_NATIVE_CODE
, which is used to enable generating native device code and storing it in the.spv
file instead of spirv. To avoid having to access the sycl runtime inside the compiler, we useocloc
(just like the nvidia backend usesptxas
to generatecubin
fromptx
. But, because there is no textual representation ofspirv
, we do not store the spirv. Originally, I had changed the file extension but decided to stick withspv
for now while we evaluate if/when we want to enable this functionality.In my testing this makes very little difference in back-to-back runs because the driver caches the native code. But this feature was requested for Inductor AOT mode where the model is exported into a self-contained library.
Close #1792