Add SYCL Kernels for XPU backend #1679

xiaolil1 · 2025-06-15T16:41:53Z

This is the pull request for the SYCL Kernels targeting the XPU backend.

It features the implementation of the "dequantize_blockwise," "dequantize_4bit," and "dequant & gemv_4bit fusion" kernels.
The target low-precision quantization datatypes encompass NF4, FP4 and General8bits.
This PR aims to eliminate the dependency on IPEX and improve the performance.

fix transpose

Signed-off-by: jiqing-feng <[email protected]>

revert cpu changes

Signed-off-by: jiqing-feng <[email protected]>

remove check for better performance

Signed-off-by: jiqing-feng <[email protected]>

fix doc

fengyuan14 · 2025-06-18T01:15:37Z

Can we use a more accurate title for the commit? or reviewers would get confused if all SYCL kernels are included in the PR.

csrc/xpu_kernels.cpp

csrc/xpu_ops.cpp

csrc/xpu_ops.h

Fix xpu check

Signed-off-by: jiqing-feng <[email protected]>

fix device check

Signed-off-by: jiqing-feng <[email protected]>

fix tests

jiqing-feng · 2025-06-24T09:24:13Z

Hi @matthewdouglas . The PR is ready to be reviewed. The sycl kernel could get 0-150% speed-up compared to triton on 4bit models. Could you take the 1st round review? Please let me know if you have any concerns. Thanks!

xiaolil1 · 2025-06-30T03:29:42Z

Can we use a more accurate title for the commit? or reviewers would get confused if all SYCL kernels are included in the PR.

This is the first PR for SYCL kernels targeting QLoRA, I have added detailed description.

Signed-off-by: jiqing-feng <[email protected]>

fix xpu log

Signed-off-by: jiqing-feng <[email protected]>

Remove ipex entirely

Signed-off-by: jiqing-feng <[email protected]>

fix lint

Egor-Krivov · 2025-07-04T08:47:29Z

@xiaolil1

When I tried to compile it, I had issues with sycl::and_range and sycl:and_item. Are you sure it's not sycl::nd_range and sycl::nd_item?

https://github.khronos.org/SYCL_Reference/iface/nd_range.html

https://github.khronos.org/SYCL_Reference/iface/nd_item.html

Egor-Krivov · 2025-07-04T09:58:51Z

I replaced types as described above and tested implementation.

In my experiment SYCL implementation was about 2x faster for token generation than triton. I guess due to fused dequant + matmul. Triton compiler currently have an issue with that: intel/intel-xpu-backend-for-triton#4327.

However, some tests failed BNB_TEST_DEVICE="xpu" pytest -q --tb=short --ignore test_optim.py --ignore test_triton.py --ignore test_cuda_setup_evaluator.py

============================================ FAILURES =============================================
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-uint8-fp16-fc2-nf4-DQ_True-xpu] ________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-fp16-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-bf16-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
________ TestQuantize4BitFunctional.test_gemv_4bit[dim=256-fp32-fp16-fc2-nf4-DQ_True-xpu] _________
test_functional.py:1339: in test_gemv_4bit
    assert relerr1 < 0.0008
E   assert 0.004344199592742371 < 0.0008
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-uint8-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-uint8-bf16-attn_packed-nf4-DQ_False-xpu] ___
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp16-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp16-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-bf16-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-bf16-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)
____ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp32-bf16-attn_packed-nf4-DQ_True-xpu] ____
test_functional.py:1370: in test_gemv_4bit
    assert maxratio < 1.05 and maxratio > 0.97
E   assert (0.965392252525759 < 1.05 and 0.965392252525759 > 0.97)
___ TestQuantize4BitFunctional.test_gemv_4bit[dim=1024-fp32-bf16-attn_packed-nf4-DQ_False-xpu] ____
test_functional.py:1369: in test_gemv_4bit
    assert relratio < 1.05 and relratio > 0.96
E   assert (0.9500951889140811 < 1.05 and 0.9500951889140811 > 0.96)

* fix logs Signed-off-by: jiqing-feng <[email protected]> * fix format Signed-off-by: jiqing-feng <[email protected]> --------- Signed-off-by: jiqing-feng <[email protected]>

xiaolil1 and others added 19 commits June 15, 2025 16:08

Add SYCL Kernels for XPU backend

dd7b173

Merge pull request #1 from xiaolil1/jiqing

df93cdd

fix transpose

fix transpose

872aa02

Signed-off-by: jiqing-feng <[email protected]>

fix log and format

04437a3

Signed-off-by: jiqing-feng <[email protected]>

revert cpu changes

d585bea

Signed-off-by: jiqing-feng <[email protected]>

clean ipex_xpu

1781611

Signed-off-by: jiqing-feng <[email protected]>

clean ipex import

c982781

Signed-off-by: jiqing-feng <[email protected]>

fix ipex cpu import

a4c5f8c

Signed-off-by: jiqing-feng <[email protected]>

fix typo

4f076bb

Signed-off-by: jiqing-feng <[email protected]>

fix comments

76d7178

Signed-off-by: jiqing-feng <[email protected]>

Merge pull request #2 from xiaolil1/jiqing

b31ea62

revert cpu changes

refine gemv_4bit kernel

452aa84

Merge branch 'main' into main

e8ac8b5

enable FP4 for dequant_4bit and gemv_4bit

8620a95

refine FP4 dequantization performance

00f064b

remove check for better performance

d60750f

Signed-off-by: jiqing-feng <[email protected]>

Merge pull request #3 from xiaolil1/jiqing

59f2aa8

remove check for better performance

fix doc

aad358f

Signed-off-by: jiqing-feng <[email protected]>

Merge pull request #4 from xiaolil1/jiqing

45e4451

fix doc

matthewdouglas added Low Priority (will be worked on after all priority issues) Intel labels Jun 17, 2025

matthewdouglas self-assigned this Jun 17, 2025

matthewdouglas self-requested a review June 17, 2025 16:19

matthewdouglas added this to the v0.48.0 milestone Jun 17, 2025