Support native code binary representation for XPU backend #2148

alexbaden · 2024-09-06T19:06:53Z

Adds a new command line flag, TRITON_XPU_GEN_NATIVE_CODE, which is used to enable generating native device code and storing it in the .spv file instead of spirv. To avoid having to access the sycl runtime inside the compiler, we use ocloc (just like the nvidia backend uses ptxas to generate cubin from ptx. But, because there is no textual representation of spirv, we do not store the spirv. Originally, I had changed the file extension but decided to stick with spv for now while we evaluate if/when we want to enable this functionality.

In my testing this makes very little difference in back-to-back runs because the driver caches the native code. But this feature was requested for Inductor AOT mode where the model is exported into a self-contained library.

spirv:
compile eval time 0: 10.648161754
compile eval time 1: 0.013945419
compile eval time 2: 0.012984403
compile eval time 3: 0.012636915
compile eval time 4: 0.0126077
compile eval time 5: 0.012621725
compile eval time 6: 0.012668987
compile eval time 7: 0.012654901
compile eval time 8: 0.01264564
compile eval time 9: 0.013606956

generated native code:
compile eval time 0: 15.920665989
compile eval time 1: 0.013856625
compile eval time 2: 0.012849391
compile eval time 3: 0.01316768
compile eval time 4: 0.012814131
compile eval time 5: 0.012739703
compile eval time 6: 0.012831038
compile eval time 7: 0.012756367
compile eval time 8: 0.012842068
compile eval time 9: 0.013376863

Close #1792

alexbaden · 2024-09-19T01:46:25Z

Testing under Torch Inductor in AOT mode I got the following timings for loadKernel:

SPIRV 
loadKernel Timing: 54 ms
loadKernel Timing: 62 ms
loadKernel Timing: 58 ms
loadKernel Timing: 62 ms
loadKernel Timing: 60 ms

Native Code
loadKernel Timing: 0 ms
loadKernel Timing: 2 ms
loadKernel Timing: 0 ms
loadKernel Timing: 2 ms
loadKernel Timing: 2 ms

So, by caching the native code we save about 60ms - but that's for a single kernel, and a typical inductor model may have 100 or more kernels.

I've made an additional cleanup pass and rebased on main so I am marking this ready for review. The functionality is under a flag so nothing will change once we merge it - we will need to decide how we want to handle our binary code generation going forward, though. Do we want to generate native code and not SPIRV by default, with the SPIRV generation under a flag? Or do we want to try and only generate native code when Inductor is in AOT mode? And, if we generate native code and not SPIRV, should we change the suffix of the binary file from .spv to something else (.zebin or .xebin)?

alexbaden · 2024-09-20T17:08:19Z

@etiotto @whitneywhtsang this is ready for review. Could one of you please take a look, or suggest a reviewer?

etiotto

I left some inline comments to address.

third_party/intel/backend/compiler.py

etiotto · 2024-09-23T17:53:05Z

third_party/intel/backend/compiler.py

+                fbin = fsrc.name + '.o'
+
+                ocloc_cmd = [
+                    'ocloc', 'compile', '-file', fsrc.name, '-o', fbin, '-spirv_input', '-device', 'pvc', '-options',


The device is hardcoded to PVC. I do not think that would work for older devices. Can we choose the device based on the information PyTorch passes to the compiler ?

Yes, we should. If you are ok with it I will make this a follow-up task though - this is not on by default so I think this behavior is acceptable for now.

etiotto · 2024-09-23T17:58:23Z

third_party/intel/backend/driver.c

-    gpuAssert(zeKernelGetProperties(l0_kernel, &props));
-    n_spills = props.spillMemSize;
-    std::cout << "(I): Kernel has now " << n_spills << " spills" << std::endl;
+  if (is_spv) {


[nit]: early exit if is_spv is false

I could put it in a lambda I suppose... but we still need the return code below the branch, right? I think I am missing something.

We could also outline the GRF handling to a new function.

Good idea - I was thinking about something similar. I will put that on my list alongside returning the GRF size as # of registers used.

third_party/intel/backend/compiler.py

victor-eds · 2024-09-24T09:16:40Z

third_party/intel/backend/compiler.py

+                                """
+                                The exact message is something like:
+                                    warning: kernel matmul_kernel  compiled SIMD16 allocated 128 regs and spilled around 217
+                                is "spilled" enough for now?
+                                """


Why stringdoc? Can we # comment instead?

It's a multi-line comment. What's the difference?

It is not a multi-line comment in fact, it is a string literal (see). Only comment syntax in Python is #. Really a NIT, but worth noting.

third_party/intel/backend/compiler.py

third_party/intel/backend/include/sycl_functions.h

Cache native code 2/? Cache native code 3/? Cache native code 4/? Cache native code 5/? Cache native code 6/? initialize a private sycl context for compilation Cache native code 7/? port the register spills code Cache native code 8/? less verbose timing logs Cache native code 9/? cleanups, fix flag, do some measuring Cache native code 10/? Use ocloc

…ode option (#2391) Intel Data Center Max GPUs will dynamically scale the number of hardware threads available per XVE depending on the specified GRF mode. With small GRF mode (default), a single hardware thread can access 128 GRF registers and each XVE engine has 8 hardware threads. In large GRF mode, a single hardware thread can access 256 GRF registers but each XVE engine only has 4 hardware threads. There is also an auto mode. ([see the docs for more info](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-2/small-register-mode-vs-large-register-mode.html)) This PR adds support for populating the `n_regs` parameter returned from loading a binary with information about the selected GRF mode. Because L0 does not return the number of registers and our register size info does not work like NVIDIA, the semantics are a bit different from upstream Triton. We _only_ return a value if the user has specified a small or large GRF mode build flag. The purpose of returning `n_regs` in upstream Triton/Torch Inductor is b/c NVIDIA can dynamically adjust occupancy of a SM based on the register pressure per warp. This means high register pressure can result in fewer running warps which reduces parallelism and performance. Theoretically, you can have many different "GRF modes" on a NVIDIA GPU as you adjust SM occupancy. For Intel GPUs, the choice is binary - large or small - and the performance penalty for register spills in small always outweighs any parallelism gains (at least, in our testing so far). It is not clear that returning 128 is actionable as further reductions in register usage will not effect occupancy - only the large GRF mode effects occupancy. So, I focused on making sure large GRF mode was properly handled and other cases were handled as we were able, with any ambiguous case returning 0 (which will cause torch inductor to skip any register-specific optimization). The approach to returning GRF size is dependent on parsing the build flags passed to the binary loader. Because the build flags are modified in the `make_spv` step during generation of native code instead of a SPIRV file, this approach should work for the native code POC recently merged in #2148. Note that I had to introduce exceptions into our `driver.c` code to make the error handling acceptable. This cleaned up a lot of the code, and I believe should be acceptable both because we already depend on c++ in `driver.c` (just not in the external signatures) and because exceptions are used in other parts of the Triton codebase. I marked this as a draft PR because I would like to do a bit more testing, but it is ready for review. Close #1641

alexbaden requested review from whitneywhtsang and etiotto September 6, 2024 19:06

pbchekin changed the base branch from llvm-target to main September 14, 2024 00:00

alexbaden force-pushed the alex/cache_native_code branch from 52f3ad9 to d444315 Compare September 19, 2024 01:40

alexbaden marked this pull request as ready for review September 19, 2024 01:43

alexbaden force-pushed the alex/cache_native_code branch from d444315 to 0918b70 Compare September 20, 2024 17:08

etiotto requested a review from a team September 23, 2024 17:43

etiotto approved these changes Sep 23, 2024

View reviewed changes

victor-eds reviewed Sep 24, 2024

View reviewed changes

alexbaden added 7 commits September 24, 2024 15:12

updates after rebase

2f595a9

remove old unused code

4ec5935

small fixups

a65af02

misc cleanups

920f404

Support detecting register spills in the native binary

bd1e0e0

Address review comments for compiler.py

d1dc02e

alexbaden force-pushed the alex/cache_native_code branch from 5392cc4 to d1dc02e Compare September 24, 2024 20:12

alexbaden merged commit 5b4d89a into main Sep 25, 2024
4 checks passed

alexbaden deleted the alex/cache_native_code branch September 25, 2024 20:40

This was referenced Sep 26, 2024

#1471 Generic SPIR-V Runner #2258

Merged

Return n_regs for binaries compiled explicitly with a register size mode option #2391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support native code binary representation for XPU backend #2148

Support native code binary representation for XPU backend #2148

alexbaden commented Sep 6, 2024 •

edited

Loading

alexbaden commented Sep 19, 2024

alexbaden commented Sep 20, 2024

etiotto left a comment

etiotto Sep 23, 2024

alexbaden Sep 24, 2024

etiotto Sep 23, 2024

alexbaden Sep 24, 2024

whitneywhtsang Sep 24, 2024

alexbaden Sep 24, 2024

victor-eds Sep 24, 2024

alexbaden Sep 24, 2024

victor-eds Sep 25, 2024

Support native code binary representation for XPU backend #2148

Support native code binary representation for XPU backend #2148

Conversation

alexbaden commented Sep 6, 2024 • edited Loading

alexbaden commented Sep 19, 2024

alexbaden commented Sep 20, 2024

etiotto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbaden commented Sep 6, 2024 •

edited

Loading