Skip to content

[SYCL][NVPTX][AMDGCN] Move devicelib cmath to header #18706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: sycl
Choose a base branch
from

Conversation

npmiller
Copy link
Contributor

@npmiller npmiller commented May 28, 2025

Overview

Currently to support C++ builtins in SYCL kernels, we rely on libdevice which
provides implementations for standard library builtins. This library is built
either to bitcode or SPIR-V and linked in our kernels.

On some targets this causes issues because clang sometimes turns standard
library calls into LLVM intrinsics that not all targets support. Specifically on
NVPTX and AMDGCN we can't easily support these intrinsics because we currently
use implementations provided by CUDA and HIP in the form of a bitcode library,
which is not something we can use from the LLVM backend.

In upstream LLVM for CUDA and HIP kernels, the way this is handled is that they
have clang headers providing device-side overloads of C++ library functions that
hook into the target specific versions of the builtins (for example std::sin
to __nv_sin). This way on the device side C++ builtins are hijacked before
clang can turn them to intrinsics which solves the issue mentioned above.

This patch is adding the infrastructure to support handling C++ builtins in SYCL
in the same way as it is done for CUDA and HIP in upstream LLVM. And is using it
to support cmath in NVPTX and AMDGCN compilation.

Breakdown

  • Add sycl_device_only attribute: This new attribute allows functions marked
    with it to be treated as device-side overload of existing functions. This is
    what allows us to overload C++ library functions for device in SYCL.
  • Remove clang hack to prevent generating LLVM intrinsics from standard library
    builtins for NVPTX and AMDGCN. In theory since this is only moving cmath, the
    hack could still be needed, but it looks fine in testing and if we run into
    issues we should just move the problematic builtins to this solution. The test
    sycl-libdevice-cmath.cpp was testing this hack, so it was removed.
  • cmath support for NVPTX and AMDGCN in libdevice was disabled. To limit the
    scope of the patch libdevice is still fully wired up for these targets, but it
    just won't provide the cmath functions.
  • Added a cmath-fallback.h header providing the device-side math function
    overloads. They are defined using SPIR-V builtins, so in theory this header
    could be used as-is for other targets.
  • Use our existing cmath stl wrapper to include cmath-fallback.h for NVPTX
    and AMDGCN. In upstream LLVM clang-cuda always includes with -include the
    header with these overloads, using the stl wrappers is a bit more selective.
  • Add rint to device lib tests and stl wrapper, this was added in
    [SYCL][Devicelib] Implement cmath rintf wrapper with __spirv_ocl_rint #18857 but wasn't in E2E testing.

Compile-time performance

A quick check of compile-time shows that this seems to provide a small performance improvement. Using two samples, one using cmath (the E2E cmath_test.cpp), and a sample not using cmath, over 10 iterations, I'm getting the following results:

Run Mean Stdev
With patch, cmath sample 4.2229s 0.0294s
With patch, no cmath sample 5.7484s 0.0525s
Without patch, cmath sample 4.3817s 0.0424s
Without patch, no cmath sample 5.7941s 0.0452s

Which suggest that the no cmath compile time performance is pretty much equivalent, and the cmath compile-time performance is faster by roughly ~0.12s.

And this is with the whole libdevice setup still in place, so it's possible this approach could be even more beneficial with more work.

Future work

  • Investigate commented out standard math builtins in cmath-fallback.h, these
    weren't defined in libdevice, we should either remove the commented out lines or
    implement them properly.
  • Untangle cmath and math.h, the current cmath-fallback.h implements both
    which seems to work fine, but ideally we should split it up.
  • Deal with nearbyint, this was only implemented for NVPTX and AMDGCN in
    libdevice, this patch keeps it the same, but we should look into proper
    support and testing for this.
  • Move more of libdevice into headers (complex, assert, crt, etc ...).
  • Try this approach for SPIR-V or other targets.

@npmiller
Copy link
Contributor Author

@bader this is a proof of concept for moving C++ library handling from libdevice code into headers. It allows us to remove the hack blocking LLVM intrinsic generation for standard math built-ins, since we intercept them earlier in the header for device side, which is in-line with what clang cuda does. Only for cmath and for Nvidia and AMD for now.

I've currently placed the header into the stl_wrappers directory, it might be better as a clang header, but at least on CUDA the clang header is always included whereas with the stl wrappers it will only be included when the matching standard library header is included.

This still needs a ton of work which is why it's a draft, but let me know if you have any feedback on the approach.

It would be good to know if this would be interesting for non-AOT targets as well, there's a lot of logic in the driver to conditionally link libdevice libraries, I suspect in theory most of that could be replaced with this header approach, but I haven't looked into this much so I'm not 100% sure if this is something we'd want.

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npmiller, thanks for working on this.

It allows us to remove the hack blocking LLVM intrinsic generation for standard math built-ins, since we intercept them earlier in the header for device side, which is in-line with what clang cuda does.

I discussed this approach with Johannes Doerfert a few years ago. He told me that he doesn't like "what clang cuda does" and plans to change it. I think clang still uses the header solution, but it may be worth to double check with LLVM community is doing any work in that direction.

I've currently placed the header into the stl_wrappers directory, it might be better as a clang header, but at least on CUDA the clang header is always included whereas with the stl wrappers it will only be included when the matching standard library header is included.

Interesting... I thought that clang only adds path to the clang headers at the beginning of the search paths list to make sure that clang wrapper header is included before STL one. I didn't know that CUDA compiler always includes clang wrapper headers.

It would be good to know if this would be interesting for non-AOT targets as well, there's a lot of logic in the driver to conditionally link libdevice libraries, I suspect in theory most of that could be replaced with this header approach, but I haven't looked into this much so I'm not 100% sure if this is something we'd want.

@AlexeySachkov, could you take into SPIR-V part, please?

The change looks to be aligned with the community approach. The only concern I have is compile time, but potential increase should be negligible.

cc @Naghasan just to keep in the loop.

Copy link
Contributor

@Naghasan Naghasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to add tests and documentation for the attribute before undrafting :)

@npmiller
Copy link
Contributor Author

Interesting... I thought that clang only adds path to the clang headers at the beginning of the search paths list to make sure that clang wrapper header is included before STL one. I didn't know that CUDA compiler always includes clang wrapper headers.

Yeah in the driver here it does:

  CC1Args.push_back("-include");
  CC1Args.push_back("__clang_cuda_runtime_wrapper.h");

And __clang_cuda_cmath.h is included from that runtime wrapper header here, it also includes <cmath>.

Using our stl wrappers solution should allow us to be a little more conservative about when we include all of this stuff.

@npmiller npmiller temporarily deployed to WindowsCILock May 29, 2025 15:31 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to WindowsCILock May 29, 2025 16:14 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to WindowsCILock May 29, 2025 16:53 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to WindowsCILock May 29, 2025 17:42 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to WindowsCILock May 29, 2025 17:42 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to WindowsCILock June 9, 2025 16:55 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to WindowsCILock June 9, 2025 17:33 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to WindowsCILock June 9, 2025 17:33 — with GitHub Actions Inactive
@npmiller npmiller requested a review from bader June 18, 2025 14:20
npmiller added 21 commits June 18, 2025 16:09
This patch experiments with moving standard library math built-ins from
libdevice into headers.

This is based on the way clang handles this for CUDA and HIP. In these
languages you can define device functions as overloads. This allows
re-defining standard library functions specifically for the device in a
header, so that we can provide a device specific implementations of
certain built-ins while still using the regular standard library
headers.

By default SYCL doesn't do overloads for device functions, so this patch
introduces a new `sycl_device_only` attribute, this attribute will make
a function device only and allow it to overload with existing functions.
We don't support malloc in SYCL, silence warnings for host compilation
with `sycl_device_only`. Fix failing clang test with new attribute.
This test was relying on the hack preventing LLVM intrinsics from being
emitted so it doesn't work at all with the new approach.
This doesn't map to a spir-v built-in
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants