Skip to content

Add SYCL Kernels for XPU backend #1679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

xiaolil1
Copy link

@xiaolil1 xiaolil1 commented Jun 15, 2025

This is the pull request for the SYCL Kernels targeting the XPU backend.

  • It features the implementation of the "dequantize_blockwise," "dequantize_4bit," and "dequant & gemv_4bit fusion" kernels.
  • The target low-precision quantization datatypes encompass NF4, FP4 and General8bits.
  • This PR aims to eliminate the dependency on IPEX and improve the performance.

@matthewdouglas matthewdouglas added Low Priority (will be worked on after all priority issues) Intel labels Jun 17, 2025
@matthewdouglas matthewdouglas self-assigned this Jun 17, 2025
@matthewdouglas matthewdouglas self-requested a review June 17, 2025 16:19
@matthewdouglas matthewdouglas added this to the v0.48.0 milestone Jun 17, 2025
@fengyuan14
Copy link

Can we use a more accurate title for the commit? or reviewers would get confused if all SYCL kernels are included in the PR.

@jiqing-feng
Copy link
Contributor

jiqing-feng commented Jun 24, 2025

Hi @matthewdouglas . The PR is ready to be reviewed. The sycl kernel could get 0-150% speed-up compared to triton on 4bit models. Could you take the 1st round review? Please let me know if you have any concerns. Thanks!

@xiaolil1 xiaolil1 marked this pull request as ready for review June 25, 2025 01:51
@xiaolil1 xiaolil1 changed the title Add SYCL Kernels for XPU backend Add SYCL Kernels for QLoRA XPU backend Jun 27, 2025
@xiaolil1 xiaolil1 changed the title Add SYCL Kernels for QLoRA XPU backend Add SYCL Kernels for XPU backend Jun 27, 2025
@xiaolil1
Copy link
Author

Can we use a more accurate title for the commit? or reviewers would get confused if all SYCL kernels are included in the PR.

This is the first PR for SYCL kernels targeting QLoRA, I have added detailed description.

@Egor-Krivov
Copy link
Contributor

@xiaolil1

When I tried to compile it, I had issues with sycl::and_range and sycl:and_item. Are you sure it's not sycl::nd_range and sycl::nd_item?

https://github.khronos.org/SYCL_Reference/iface/nd_range.html

https://github.khronos.org/SYCL_Reference/iface/nd_item.html

@Egor-Krivov
Copy link
Contributor

I replaced types as described above and tested implementation.

In my experiment SYCL implementation was about 2x faster for token generation than triton. I guess due to fused dequant + matmul. Triton compiler currently have an issue with that: intel/intel-xpu-backend-for-triton#4327.

However, some tests failed BNB_TEST_DEVICE="xpu" pytest -q --tb=short --ignore test_optim.py --ignore test_triton.py --ignore test_cuda_setup_evaluator.py

============================================ FAILURES =============================================
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-uint8-fp16-fc2-nf4-DQ_True-xpu] ________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-fp16-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-bf16-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-fp32-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-uint8-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-uint8-bf16-attn_packed-nf4-DQ_False-xpu] ___
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp16-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp16-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-bf16-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-bf16-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp32-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp32-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)

* fix logs

Signed-off-by: jiqing-feng <[email protected]>

* fix format

Signed-off-by: jiqing-feng <[email protected]>

---------

Signed-off-by: jiqing-feng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Intel Low Priority (will be worked on after all priority issues)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants