Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support native code binary representation for XPU backend #2148

Merged
merged 7 commits into from
Sep 25, 2024

Conversation

alexbaden
Copy link
Contributor

@alexbaden alexbaden commented Sep 6, 2024

Adds a new command line flag, TRITON_XPU_GEN_NATIVE_CODE, which is used to enable generating native device code and storing it in the .spv file instead of spirv. To avoid having to access the sycl runtime inside the compiler, we use ocloc (just like the nvidia backend uses ptxas to generate cubin from ptx. But, because there is no textual representation of spirv, we do not store the spirv. Originally, I had changed the file extension but decided to stick with spv for now while we evaluate if/when we want to enable this functionality.

In my testing this makes very little difference in back-to-back runs because the driver caches the native code. But this feature was requested for Inductor AOT mode where the model is exported into a self-contained library.

spirv:
compile eval time 0: 10.648161754
compile eval time 1: 0.013945419
compile eval time 2: 0.012984403
compile eval time 3: 0.012636915
compile eval time 4: 0.0126077
compile eval time 5: 0.012621725
compile eval time 6: 0.012668987
compile eval time 7: 0.012654901
compile eval time 8: 0.01264564
compile eval time 9: 0.013606956

generated native code:
compile eval time 0: 15.920665989
compile eval time 1: 0.013856625
compile eval time 2: 0.012849391
compile eval time 3: 0.01316768
compile eval time 4: 0.012814131
compile eval time 5: 0.012739703
compile eval time 6: 0.012831038
compile eval time 7: 0.012756367
compile eval time 8: 0.012842068
compile eval time 9: 0.013376863

Close #1792

@pbchekin pbchekin changed the base branch from llvm-target to main September 14, 2024 00:00
@alexbaden alexbaden force-pushed the alex/cache_native_code branch from 52f3ad9 to d444315 Compare September 19, 2024 01:40
@alexbaden alexbaden marked this pull request as ready for review September 19, 2024 01:43
@alexbaden
Copy link
Contributor Author

Testing under Torch Inductor in AOT mode I got the following timings for loadKernel:

SPIRV 
loadKernel Timing: 54 ms
loadKernel Timing: 62 ms
loadKernel Timing: 58 ms
loadKernel Timing: 62 ms
loadKernel Timing: 60 ms

Native Code
loadKernel Timing: 0 ms
loadKernel Timing: 2 ms
loadKernel Timing: 0 ms
loadKernel Timing: 2 ms
loadKernel Timing: 2 ms

So, by caching the native code we save about 60ms - but that's for a single kernel, and a typical inductor model may have 100 or more kernels.

I've made an additional cleanup pass and rebased on main so I am marking this ready for review. The functionality is under a flag so nothing will change once we merge it - we will need to decide how we want to handle our binary code generation going forward, though. Do we want to generate native code and not SPIRV by default, with the SPIRV generation under a flag? Or do we want to try and only generate native code when Inductor is in AOT mode? And, if we generate native code and not SPIRV, should we change the suffix of the binary file from .spv to something else (.zebin or .xebin)?

@alexbaden alexbaden force-pushed the alex/cache_native_code branch from d444315 to 0918b70 Compare September 20, 2024 17:08
@alexbaden
Copy link
Contributor Author

@etiotto @whitneywhtsang this is ready for review. Could one of you please take a look, or suggest a reviewer?

@etiotto etiotto requested a review from a team September 23, 2024 17:43
Copy link
Contributor

@etiotto etiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some inline comments to address.

third_party/intel/backend/compiler.py Outdated Show resolved Hide resolved
fbin = fsrc.name + '.o'

ocloc_cmd = [
'ocloc', 'compile', '-file', fsrc.name, '-o', fbin, '-spirv_input', '-device', 'pvc', '-options',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device is hardcoded to PVC. I do not think that would work for older devices. Can we choose the device based on the information PyTorch passes to the compiler ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. If you are ok with it I will make this a follow-up task though - this is not on by default so I think this behavior is acceptable for now.

gpuAssert(zeKernelGetProperties(l0_kernel, &props));
n_spills = props.spillMemSize;
std::cout << "(I): Kernel has now " << n_spills << " spills" << std::endl;
if (is_spv) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: early exit if is_spv is false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could put it in a lambda I suppose... but we still need the return code below the branch, right? I think I am missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also outline the GRF handling to a new function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - I was thinking about something similar. I will put that on my list alongside returning the GRF size as # of registers used.

third_party/intel/backend/compiler.py Outdated Show resolved Hide resolved
third_party/intel/backend/compiler.py Outdated Show resolved Hide resolved
third_party/intel/backend/compiler.py Show resolved Hide resolved
Comment on lines +321 to +325
"""
The exact message is something like:
warning: kernel matmul_kernel compiled SIMD16 allocated 128 regs and spilled around 217
is "spilled" enough for now?
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stringdoc? Can we # comment instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a multi-line comment. What's the difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a multi-line comment in fact, it is a string literal (see). Only comment syntax in Python is #. Really a NIT, but worth noting.

third_party/intel/backend/compiler.py Outdated Show resolved Hide resolved
third_party/intel/backend/compiler.py Show resolved Hide resolved
third_party/intel/backend/compiler.py Show resolved Hide resolved
third_party/intel/backend/compiler.py Show resolved Hide resolved
Cache native code 2/?

Cache native code 3/?

Cache native code 4/?

Cache native code 5/?

Cache native code 6/?

initialize a private sycl context for compilation

Cache native code 7/?

port the register spills code

Cache native code 8/?

less verbose timing logs

Cache native code 9/?

cleanups, fix flag, do some measuring

Cache native code 10/?

Use ocloc
@alexbaden alexbaden force-pushed the alex/cache_native_code branch from 5392cc4 to d1dc02e Compare September 24, 2024 20:12
@alexbaden alexbaden merged commit 5b4d89a into main Sep 25, 2024
4 checks passed
@alexbaden alexbaden deleted the alex/cache_native_code branch September 25, 2024 20:40
alexbaden added a commit that referenced this pull request Oct 18, 2024
…ode option (#2391)

Intel Data Center Max GPUs will dynamically scale the number of hardware
threads available per XVE depending on the specified GRF mode. With
small GRF mode (default), a single hardware thread can access 128 GRF
registers and each XVE engine has 8 hardware threads. In large GRF mode,
a single hardware thread can access 256 GRF registers but each XVE
engine only has 4 hardware threads. There is also an auto mode.

([see the docs for more
info](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-2/small-register-mode-vs-large-register-mode.html))

This PR adds support for populating the `n_regs` parameter returned from
loading a binary with information about the selected GRF mode. Because
L0 does not return the number of registers and our register size info
does not work like NVIDIA, the semantics are a bit different from
upstream Triton. We _only_ return a value if the user has specified a
small or large GRF mode build flag. The purpose of returning `n_regs` in
upstream Triton/Torch Inductor is b/c NVIDIA can dynamically adjust
occupancy of a SM based on the register pressure per warp. This means
high register pressure can result in fewer running warps which reduces
parallelism and performance. Theoretically, you can have many different
"GRF modes" on a NVIDIA GPU as you adjust SM occupancy. For Intel GPUs,
the choice is binary - large or small - and the performance penalty for
register spills in small always outweighs any parallelism gains (at
least, in our testing so far). It is not clear that returning 128 is
actionable as further reductions in register usage will not effect
occupancy - only the large GRF mode effects occupancy. So, I focused on
making sure large GRF mode was properly handled and other cases were
handled as we were able, with any ambiguous case returning 0 (which will
cause torch inductor to skip any register-specific optimization).

The approach to returning GRF size is dependent on parsing the build
flags passed to the binary loader. Because the build flags are modified
in the `make_spv` step during generation of native code instead of a
SPIRV file, this approach should work for the native code POC recently
merged in #2148.

Note that I had to introduce exceptions into our `driver.c` code to make
the error handling acceptable. This cleaned up a lot of the code, and I
believe should be acceptable both because we already depend on c++ in
`driver.c` (just not in the external signatures) and because exceptions
are used in other parts of the Triton codebase.

I marked this as a draft PR because I would like to do a bit more
testing, but it is ready for review.

Close #1641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Research][PyTorch 2.6] Save compiled triton kernel as device binary code
4 participants