Skip to content

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Jul 27, 2025

Here's an initial version of an Integer Dot mul_mat_vec shader. So far it seems to improve performance with q4_1 and q5_1, but reduce it with q4_0, q5_0 and q8_0. My guess is that this is because of the 32-bit loads in q4_1 and q5_1, while the rest use 16-bit loads.

@jeffbolznv Would you mind taking a look and letting me know if I have any obvious performance issues in the shader?

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 27, 2025
@0cc4m
Copy link
Collaborator Author

0cc4m commented Jul 27, 2025

Here are performance results from my tests:

AMD Radeon Pro VII
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 3720 runs -   326.01 us/run - 134.48 MFLOP/run - 412.51 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   274.52 us/run - 134.48 MFLOP/run - 489.87 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    95.15 us/run - 117.44 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   114.44 us/run - 117.44 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   136.38 us/run - 117.44 MFLOP/run - 861.11 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.87 us/run - 117.44 MFLOP/run - 783.61 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.03 us/run - 117.44 MFLOP/run - 782.80 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   121.87 us/run - 234.88 MFLOP/run -   1.93 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   181.40 us/run - 234.88 MFLOP/run -   1.29 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   166.30 us/run - 234.88 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   206.09 us/run - 234.88 MFLOP/run -   1.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.76 us/run - 234.88 MFLOP/run -   1.19 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.56 us/run - 352.32 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4544 runs -   229.63 us/run - 352.32 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5396 runs -   189.94 us/run - 352.32 MFLOP/run -   1.85 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   259.13 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   258.81 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.43 us/run - 469.76 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3621 runs -   278.23 us/run - 469.76 MFLOP/run -   1.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   218.20 us/run - 469.76 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   307.29 us/run - 469.76 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2769 runs -   382.97 us/run - 469.76 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4617 runs -   224.90 us/run - 587.20 MFLOP/run -   2.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3078 runs -   330.95 us/run - 587.20 MFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4104 runs -   250.29 us/run - 587.20 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   365.23 us/run - 587.20 MFLOP/run -   1.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   452.07 us/run - 587.20 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.45 us/run - 939.52 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   682.41 us/run - 939.52 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   335.38 us/run - 939.52 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1391 runs -   725.50 us/run - 939.52 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   677.66 us/run - 939.52 MFLOP/run -   1.39 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7371.35 us/run -  60.13 GFLOP/run -   8.16 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7697.38 us/run -  60.13 GFLOP/run -   7.81 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7584.95 us/run -  60.13 GFLOP/run -   7.93 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7931.54 us/run -  60.13 GFLOP/run -   7.58 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8015.00 us/run -  60.13 GFLOP/run -   7.50 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 3720 runs -   326.21 us/run - 134.48 MFLOP/run - 412.25 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   274.08 us/run - 134.48 MFLOP/run - 490.66 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   129.72 us/run - 117.44 MFLOP/run - 905.32 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    62.43 us/run - 117.44 MFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.69 us/run - 117.44 MFLOP/run - 754.32 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    83.28 us/run - 117.44 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   216.83 us/run - 117.44 MFLOP/run - 541.62 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   165.83 us/run - 234.88 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.15 us/run - 234.88 MFLOP/run -   3.35 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   200.41 us/run - 234.88 MFLOP/run -   1.17 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    92.60 us/run - 234.88 MFLOP/run -   2.54 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   232.55 us/run - 234.88 MFLOP/run -   1.01 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.32 us/run - 352.32 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11360 runs -    89.56 us/run - 352.32 MFLOP/run -   3.93 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.72 us/run - 352.32 MFLOP/run -   1.79 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   111.35 us/run - 352.32 MFLOP/run -   3.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   254.72 us/run - 352.32 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5751 runs -   175.38 us/run - 469.76 MFLOP/run -   2.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8733 runs -   115.33 us/run - 469.76 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4899 runs -   206.11 us/run - 469.76 MFLOP/run -   2.28 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   133.48 us/run - 469.76 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   267.06 us/run - 469.76 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5130 runs -   199.10 us/run - 587.20 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6840 runs -   147.29 us/run - 587.20 MFLOP/run -   3.99 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4446 runs -   228.99 us/run - 587.20 MFLOP/run -   2.56 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5472 runs -   186.59 us/run - 587.20 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3420 runs -   296.54 us/run - 587.20 MFLOP/run -   1.98 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4922 runs -   205.31 us/run - 939.52 MFLOP/run -   4.58 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7276 runs -   138.46 us/run - 939.52 MFLOP/run -   6.79 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4173 runs -   245.35 us/run - 939.52 MFLOP/run -   3.83 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6313 runs -   160.81 us/run - 939.52 MFLOP/run -   5.84 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3210 runs -   318.22 us/run - 939.52 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7386.12 us/run -  60.13 GFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7693.49 us/run -  60.13 GFLOP/run -   7.82 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7594.42 us/run -  60.13 GFLOP/run -   7.92 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7918.03 us/run -  60.13 GFLOP/run -   7.59 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8004.06 us/run -  60.13 GFLOP/run -   7.51 TFLOPS
Intel A770
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 9672 runs -   106.14 us/run - 134.48 MFLOP/run -   1.27 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   297.67 us/run - 134.48 MFLOP/run - 451.77 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   147.62 us/run - 117.44 MFLOP/run - 795.55 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   158.42 us/run - 117.44 MFLOP/run - 741.31 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2556 runs -   559.94 us/run - 117.44 MFLOP/run - 209.74 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   198.08 us/run - 117.44 MFLOP/run - 592.89 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   816.05 us/run - 117.44 MFLOP/run - 143.91 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.66 us/run - 234.88 MFLOP/run -   1.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   185.73 us/run - 234.88 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   483.76 us/run - 234.88 MFLOP/run - 485.54 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   201.83 us/run - 234.88 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1278 runs -   953.98 us/run - 234.88 MFLOP/run - 246.21 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6248 runs -   165.98 us/run - 352.32 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   210.20 us/run - 352.32 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1988 runs -   513.99 us/run - 352.32 MFLOP/run - 685.46 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   218.03 us/run - 352.32 MFLOP/run -   1.62 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   648.93 us/run - 352.32 MFLOP/run - 542.93 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.04 us/run - 469.76 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   265.17 us/run - 469.76 MFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   505.40 us/run - 469.76 MFLOP/run - 929.49 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4047 runs -   258.71 us/run - 469.76 MFLOP/run -   1.82 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1491 runs -   673.07 us/run - 469.76 MFLOP/run - 697.94 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3249 runs -   308.76 us/run - 587.20 MFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   465.28 us/run - 587.20 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1710 runs -   619.83 us/run - 587.20 MFLOP/run - 947.36 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   477.48 us/run - 587.20 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1197 runs -   931.89 us/run - 587.20 MFLOP/run - 630.12 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3103 runs -   330.52 us/run - 939.52 MFLOP/run -   2.84 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2247 runs -   462.68 us/run - 939.52 MFLOP/run -   2.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1712 runs -   589.40 us/run - 939.52 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2140 runs -   470.27 us/run - 939.52 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                963 runs -  1085.13 us/run - 939.52 MFLOP/run - 865.81 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5539.21 us/run -  60.13 GFLOP/run -  10.86 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      184 runs -  5460.43 us/run -  60.13 GFLOP/run -  11.01 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      174 runs -  5796.34 us/run -  60.13 GFLOP/run -  10.37 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      172 runs -  5816.45 us/run -  60.13 GFLOP/run -  10.34 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6317.52 us/run -  60.13 GFLOP/run -   9.52 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 9672 runs -   105.39 us/run - 134.48 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   300.54 us/run - 134.48 MFLOP/run - 447.46 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   232.85 us/run - 117.44 MFLOP/run - 504.37 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   127.81 us/run - 117.44 MFLOP/run - 918.88 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4260 runs -   252.01 us/run - 117.44 MFLOP/run - 466.01 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   153.16 us/run - 117.44 MFLOP/run - 766.79 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4260 runs -   253.84 us/run - 117.44 MFLOP/run - 462.65 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   288.94 us/run - 234.88 MFLOP/run - 812.90 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   110.96 us/run - 234.88 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   317.45 us/run - 234.88 MFLOP/run - 739.90 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   135.61 us/run - 234.88 MFLOP/run -   1.73 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   264.55 us/run - 234.88 MFLOP/run - 887.85 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   297.55 us/run - 352.32 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   132.35 us/run - 352.32 MFLOP/run -   2.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3124 runs -   339.23 us/run - 352.32 MFLOP/run -   1.04 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6532 runs -   154.97 us/run - 352.32 MFLOP/run -   2.27 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3692 runs -   275.87 us/run - 352.32 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3195 runs -   316.93 us/run - 469.76 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   146.76 us/run - 469.76 MFLOP/run -   3.20 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2982 runs -   352.12 us/run - 469.76 MFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   181.20 us/run - 469.76 MFLOP/run -   2.59 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   305.57 us/run - 469.76 MFLOP/run -   1.54 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3762 runs -   273.06 us/run - 587.20 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5643 runs -   179.14 us/run - 587.20 MFLOP/run -   3.28 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2736 runs -   369.60 us/run - 587.20 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   212.93 us/run - 587.20 MFLOP/run -   2.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   361.02 us/run - 587.20 MFLOP/run -   1.63 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2568 runs -   400.11 us/run - 939.52 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3424 runs -   300.82 us/run - 939.52 MFLOP/run -   3.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2354 runs -   435.22 us/run - 939.52 MFLOP/run -   2.16 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.42 us/run - 939.52 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2782 runs -   371.29 us/run - 939.52 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5502.12 us/run -  60.13 GFLOP/run -  10.93 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5522.41 us/run -  60.13 GFLOP/run -  10.89 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      174 runs -  5776.55 us/run -  60.13 GFLOP/run -  10.41 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      166 runs -  6064.83 us/run -  60.13 GFLOP/run -   9.91 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6308.83 us/run -  60.13 GFLOP/run -   9.53 TFLOPS
Nvidia RTX 3090
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                11160 runs -    94.56 us/run - 134.48 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 7440 runs -   134.50 us/run - 134.48 MFLOP/run - 999.84 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.24 us/run - 117.44 MFLOP/run -   2.38 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    54.12 us/run - 117.44 MFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    69.91 us/run - 117.44 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.77 us/run - 117.44 MFLOP/run -   1.66 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    82.06 us/run - 117.44 MFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    61.82 us/run - 234.88 MFLOP/run -   3.80 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13206 runs -    77.28 us/run - 234.88 MFLOP/run -   3.04 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    82.16 us/run - 234.88 MFLOP/run -   2.86 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    94.23 us/run - 234.88 MFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    95.96 us/run - 234.88 MFLOP/run -   2.45 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13064 runs -    77.12 us/run - 352.32 MFLOP/run -   4.57 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    96.38 us/run - 352.32 MFLOP/run -   3.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10792 runs -    94.85 us/run - 352.32 MFLOP/run -   3.71 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   112.82 us/run - 352.32 MFLOP/run -   3.12 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7952 runs -   126.59 us/run - 352.32 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10863 runs -    93.34 us/run - 469.76 MFLOP/run -   5.03 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8733 runs -   115.35 us/run - 469.76 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8946 runs -   112.26 us/run - 469.76 MFLOP/run -   4.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7455 runs -   136.60 us/run - 469.76 MFLOP/run -   3.44 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6603 runs -   156.48 us/run - 469.76 MFLOP/run -   3.00 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9063 runs -   111.42 us/run - 587.20 MFLOP/run -   5.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7353 runs -   138.83 us/run - 587.20 MFLOP/run -   4.23 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   127.26 us/run - 587.20 MFLOP/run -   4.61 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6498 runs -   156.34 us/run - 587.20 MFLOP/run -   3.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5472 runs -   185.98 us/run - 587.20 MFLOP/run -   3.16 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6099 runs -   165.53 us/run - 939.52 MFLOP/run -   5.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4708 runs -   213.55 us/run - 939.52 MFLOP/run -   4.40 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5671 runs -   179.37 us/run - 939.52 MFLOP/run -   5.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4387 runs -   229.11 us/run - 939.52 MFLOP/run -   4.10 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3745 runs -   274.08 us/run - 939.52 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      904 runs -  1108.01 us/run -  60.13 GFLOP/run -  54.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      860 runs -  1164.53 us/run -  60.13 GFLOP/run -  51.63 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      736 runs -  1361.15 us/run -  60.13 GFLOP/run -  44.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      736 runs -  1360.98 us/run -  60.13 GFLOP/run -  44.18 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      912 runs -  1097.27 us/run -  60.13 GFLOP/run -  54.80 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                11160 runs -    94.68 us/run - 134.48 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 8184 runs -   130.28 us/run - 134.48 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    50.12 us/run - 117.44 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    48.13 us/run - 117.44 MFLOP/run -   2.44 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.03 us/run - 117.44 MFLOP/run -   2.10 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.74 us/run - 117.44 MFLOP/run -   2.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.46 us/run - 117.44 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.08 us/run - 234.88 MFLOP/run -   4.99 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.93 us/run - 234.88 MFLOP/run -   4.70 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.08 us/run - 234.88 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.47 us/run - 234.88 MFLOP/run -   4.02 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11502 runs -    88.02 us/run - 234.88 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19880 runs -    50.74 us/run - 352.32 MFLOP/run -   6.94 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19596 runs -    51.30 us/run - 352.32 MFLOP/run -   6.87 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15904 runs -    63.94 us/run - 352.32 MFLOP/run -   5.51 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16472 runs -    61.01 us/run - 352.32 MFLOP/run -   5.77 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    91.62 us/run - 352.32 MFLOP/run -   3.85 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.33 us/run - 469.76 MFLOP/run -   8.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    57.69 us/run - 469.76 MFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15123 runs -    66.30 us/run - 469.76 MFLOP/run -   7.09 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.62 us/run - 469.76 MFLOP/run -   7.27 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10437 runs -    97.62 us/run - 469.76 MFLOP/run -   4.81 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15732 runs -    63.62 us/run - 587.20 MFLOP/run -   9.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16245 runs -    61.62 us/run - 587.20 MFLOP/run -   9.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14535 runs -    69.60 us/run - 587.20 MFLOP/run -   8.44 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14535 runs -    69.57 us/run - 587.20 MFLOP/run -   8.44 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9576 runs -   104.78 us/run - 587.20 MFLOP/run -   5.60 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12947 runs -    77.25 us/run - 939.52 MFLOP/run -  12.16 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11877 runs -    84.66 us/run - 939.52 MFLOP/run -  11.10 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11877 runs -    84.27 us/run - 939.52 MFLOP/run -  11.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11342 runs -    88.87 us/run - 939.52 MFLOP/run -  10.57 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7597 runs -   133.14 us/run - 939.52 MFLOP/run -   7.06 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      842 runs -  1187.83 us/run -  60.13 GFLOP/run -  50.62 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      784 runs -  1277.27 us/run -  60.13 GFLOP/run -  47.08 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      762 runs -  1313.98 us/run -  60.13 GFLOP/run -  45.76 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      738 runs -  1355.59 us/run -  60.13 GFLOP/run -  44.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      924 runs -  1083.58 us/run -  60.13 GFLOP/run -  55.49 TFLOPS
AMD RX 6800 XT
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared 

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 7440 runs -   145.62 us/run - 134.48 MFLOP/run - 923.47 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                20088 runs -    50.37 us/run - 134.48 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.14 us/run - 117.44 MFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    55.37 us/run - 117.44 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.00 us/run - 117.44 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13632 runs -    74.29 us/run - 117.44 MFLOP/run -   1.58 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.72 us/run - 117.44 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    61.98 us/run - 234.88 MFLOP/run -   3.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.87 us/run - 234.88 MFLOP/run -   2.98 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.15 us/run - 234.88 MFLOP/run -   2.73 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    98.12 us/run - 234.88 MFLOP/run -   2.39 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11502 runs -    89.74 us/run - 234.88 MFLOP/run -   2.62 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13064 runs -    76.56 us/run - 352.32 MFLOP/run -   4.60 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9940 runs -   102.12 us/run - 352.32 MFLOP/run -   3.45 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -   100.07 us/run - 352.32 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8236 runs -   123.05 us/run - 352.32 MFLOP/run -   2.86 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8236 runs -   122.62 us/run - 352.32 MFLOP/run -   2.87 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    99.78 us/run - 469.76 MFLOP/run -   4.71 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   119.36 us/run - 469.76 MFLOP/run -   3.94 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9159 runs -   110.68 us/run - 469.76 MFLOP/run -   4.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7242 runs -   139.27 us/run - 469.76 MFLOP/run -   3.37 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   167.74 us/run - 469.76 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   128.65 us/run - 587.20 MFLOP/run -   4.56 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7011 runs -   144.22 us/run - 587.20 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6669 runs -   150.20 us/run - 587.20 MFLOP/run -   3.91 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6327 runs -   161.58 us/run - 587.20 MFLOP/run -   3.63 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   211.00 us/run - 587.20 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5029 runs -   200.80 us/run - 939.52 MFLOP/run -   4.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4922 runs -   206.88 us/run - 939.52 MFLOP/run -   4.54 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4280 runs -   233.96 us/run - 939.52 MFLOP/run -   4.02 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4494 runs -   225.62 us/run - 939.52 MFLOP/run -   4.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2675 runs -   386.25 us/run - 939.52 MFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      348 runs -  2882.03 us/run -  60.13 GFLOP/run -  20.86 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      354 runs -  2837.71 us/run -  60.13 GFLOP/run -  21.19 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      342 runs -  2934.56 us/run -  60.13 GFLOP/run -  20.49 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      336 runs -  2993.35 us/run -  60.13 GFLOP/run -  20.09 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      306 runs -  3282.89 us/run -  60.13 GFLOP/run -  18.32 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 7440 runs -   142.46 us/run - 134.48 MFLOP/run - 943.97 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                20832 runs -    48.66 us/run - 134.48 MFLOP/run -   2.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    33.86 us/run - 117.44 MFLOP/run -   3.47 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              36636 runs -    27.51 us/run - 117.44 MFLOP/run -   4.27 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24708 runs -    41.87 us/run - 117.44 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    34.24 us/run - 117.44 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    44.41 us/run - 117.44 MFLOP/run -   2.64 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21726 runs -    46.39 us/run - 234.88 MFLOP/run -   5.06 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              35358 runs -    28.40 us/run - 234.88 MFLOP/run -   8.27 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.68 us/run - 234.88 MFLOP/run -   4.38 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              26412 runs -    38.09 us/run - 234.88 MFLOP/run -   6.17 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.58 us/run - 234.88 MFLOP/run -   4.83 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19028 runs -    52.74 us/run - 352.32 MFLOP/run -   6.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24708 runs -    40.71 us/run - 352.32 MFLOP/run -   8.65 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.88 us/run - 352.32 MFLOP/run -   5.98 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19880 runs -    50.48 us/run - 352.32 MFLOP/run -   6.98 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18176 runs -    55.56 us/run - 352.32 MFLOP/run -   6.34 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16401 runs -    61.21 us/run - 469.76 MFLOP/run -   7.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.35 us/run - 469.76 MFLOP/run -   9.72 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14271 runs -    70.92 us/run - 469.76 MFLOP/run -   6.62 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.84 us/run - 469.76 MFLOP/run -   7.24 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15762 runs -    63.88 us/run - 469.76 MFLOP/run -   7.35 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16245 runs -    61.77 us/run - 587.20 MFLOP/run -   9.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14877 runs -    67.57 us/run - 587.20 MFLOP/run -   8.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14022 runs -    71.52 us/run - 587.20 MFLOP/run -   8.21 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12312 runs -    81.39 us/run - 587.20 MFLOP/run -   7.21 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13680 runs -    73.56 us/run - 587.20 MFLOP/run -   7.98 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9844 runs -   102.64 us/run - 939.52 MFLOP/run -   9.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11021 runs -    91.60 us/run - 939.52 MFLOP/run -  10.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9202 runs -   108.77 us/run - 939.52 MFLOP/run -   8.64 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9095 runs -   110.57 us/run - 939.52 MFLOP/run -   8.50 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10486 runs -    95.77 us/run - 939.52 MFLOP/run -   9.81 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      362 runs -  2774.96 us/run -  60.13 GFLOP/run -  21.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      356 runs -  2815.14 us/run -  60.13 GFLOP/run -  21.36 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      338 runs -  2968.24 us/run -  60.13 GFLOP/run -  20.26 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      326 runs -  3080.20 us/run -  60.13 GFLOP/run -  19.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      292 runs -  3442.73 us/run -  60.13 GFLOP/run -  17.47 TFLOPS

@jeffbolznv
Copy link
Collaborator

I did a quick before/after on some Q4_0 models, and it looks like the quantization is pretty expensive:

master:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        365.51 ± 1.33 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        364.74 ± 3.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        236.24 ± 7.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        237.61 ± 1.79 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.41 ± 0.87 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.44 ± 0.15 |

PR:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        340.06 ± 1.73 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        339.06 ± 2.71 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       224.50 ± 10.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        227.18 ± 1.44 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         57.65 ± 0.07 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         57.67 ± 0.11 |

PR with quantize call removed:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        372.26 ± 1.13 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        370.48 ± 3.75 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        242.30 ± 3.98 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        243.00 ± 1.00 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         59.49 ± 0.16 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         59.28 ± 0.14 |

I don't think there's anything particularly wrong with how the quantization is implemented, it's such a small amount of work that it doesn't fill the GPU, and 5090 is just about the worst case for that. I don't have any great suggestions for what to do about this.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jul 28, 2025

Yeah, I also see that. We might have to pick a threshold from which using this quantize + integer dot shader path is worth it. Even without further tuning, there are definitely cases where it helps, for example batch 4 and 8 on RX 6800 XT:

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.372 1378.10 6.499 78.78 6.871 149.04
512 512 2 2048 0.734 1394.93 11.341 90.29 12.075 169.60
512 512 4 4096 1.551 1320.62 18.337 111.69 19.887 205.96
512 512 8 8192 3.499 1170.69 34.641 118.24 38.139 214.79
512 512 16 16384 8.295 987.59 59.502 137.68 67.797 241.66
512 512 32 32768 21.548 760.35 85.820 190.91 107.368 305.19

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.372 1376.71 6.980 73.35 7.352 139.28
512 512 2 2048 0.721 1420.49 11.889 86.13 12.610 162.42
512 512 4 4096 1.562 1311.47 17.186 119.17 18.747 218.49
512 512 8 8192 3.482 1176.48 29.917 136.91 33.398 245.28
512 512 16 16384 8.253 992.55 59.530 137.61 67.783 241.71
512 512 32 32768 21.490 762.41 85.655 191.28 107.145 305.83

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from fd8be28 to c19ec8f Compare August 2, 2025 12:15
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 2, 2025

I implemented the q8_1_x4 blocks that align q8_1 to 128-bits, using them does help a little (there's even a pp increase for integer dot prompt processing), but the integer dot mmv path is still too slow to enable universally. I'm thinking about ways to use shared memory in the mmv shader, but not sure if that would help.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 2, 2025

Here are some results from the current version:

Nvidia RTX 3090

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.900 1050.14 6.207 82.49 10.107 455.90
4096 512 2 9216 6.033 1357.88 30.604 33.46 36.636 251.55
4096 512 4 18432 16.503 992.79 58.499 35.01 75.002 245.75

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.912 1047.06 6.444 79.45 10.356 444.97
4096 512 2 9216 6.079 1347.60 30.561 33.51 36.640 251.53
4096 512 4 18432 16.582 988.07 57.161 35.83 73.743 249.95

On Nvidia, the batched-bench seems to have an issue with shader compiles slowing down some of the runs.

AMD Radeon RX 6800 XT

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.565 1149.01 8.900 57.53 12.465 369.68
4096 512 2 9216 8.519 961.64 32.738 31.28 41.256 223.38
4096 512 4 18432 22.255 736.19 61.596 33.25 83.851 219.82

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 512 1 4608 3.225 1269.99 9.385 54.56 12.610 365.42
4096 512 2 9216 7.859 1042.32 32.840 31.18 40.700 226.44
4096 512 4 18432 20.895 784.09 59.938 34.17 80.833 228.03
AMD Radeon Pro VII

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.811 631.55 8.859 57.80 9.669 105.90
512 512 2 2048 1.539 665.54 19.551 52.38 21.090 97.11
512 512 4 4096 3.241 631.98 33.277 61.54 36.517 112.17

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.805 635.87 11.381 44.99 12.186 84.03
512 512 2 2048 1.485 689.54 22.796 44.92 24.281 84.35
512 512 4 4096 3.126 655.05 33.547 61.05 36.673 111.69
Intel A770

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.702 729.48 16.858 30.37 17.560 58.31
512 512 2 2048 1.495 685.10 30.323 33.77 31.818 64.37
512 512 4 4096 3.360 609.61 48.322 42.38 51.681 79.26

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.607 843.46 20.431 25.06 21.038 48.67
512 512 2 2048 1.306 783.96 35.346 28.97 36.652 55.88
512 512 4 4096 2.971 689.24 53.052 38.60 56.024 73.11

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 3, 2025

Here are some new results, performance is looking better now, even for small models. Selecting when to enable this and when not to is still tricky, though.

@jeffbolznv Can you retest on your worst-case 5090? On my 3090 it looks like enabling this path on Nvidia may be worth it on Q4_1 and Q5_1, since they perform best due to 16B alignment. If you see further optimization opportunities, let me know.

Nvidia RTX 3090 (without coopmat1/2)

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 9194.72 ± 323.10 8926.13 ± 203.39
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 324.21 ± 56.21 311.07 ± 51.04
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 9189.23 ± 148.56 9296.94 ± 194.27
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 336.64 ± 10.50 327.20 ± 0.56
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 8678.07 ± 32.60 9060.36 ± 21.48
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 304.93 ± 5.38 310.19 ± 4.96
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 8807.90 ± 204.32 9108.72 ± 30.17
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 303.30 ± 3.87 292.32 ± 0.86
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 9058.35 ± 32.73 9101.32 ± 23.69
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 288.87 ± 2.46 267.09 ± 2.26
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 1912.84 ± 15.65 1924.58 ± 9.18
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 107.09 ± 0.18 107.85 ± 0.75
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 1856.80 ± 10.88 1898.31 ± 9.23
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 101.17 ± 0.30 108.18 ± 0.15
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 1884.42 ± 11.34 1898.32 ± 8.53
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 75.10 ± 0.15 74.57 ± 0.11
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.75 us/run - 117.44 MFLOP/run -   2.36 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    54.75 us/run - 117.44 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    71.58 us/run - 117.44 MFLOP/run -   1.64 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    72.32 us/run - 117.44 MFLOP/run -   1.62 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    82.89 us/run - 117.44 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    62.95 us/run - 234.88 MFLOP/run -   3.73 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.99 us/run - 234.88 MFLOP/run -   2.97 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    83.76 us/run - 234.88 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    96.60 us/run - 234.88 MFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    98.03 us/run - 234.88 MFLOP/run -   2.40 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.66 us/run - 352.32 MFLOP/run -   4.48 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    98.22 us/run - 352.32 MFLOP/run -   3.59 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    96.41 us/run - 352.32 MFLOP/run -   3.65 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8804 runs -   114.84 us/run - 352.32 MFLOP/run -   3.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7952 runs -   128.65 us/run - 352.32 MFLOP/run -   2.74 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    94.40 us/run - 469.76 MFLOP/run -   4.98 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   117.56 us/run - 469.76 MFLOP/run -   4.00 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8946 runs -   113.66 us/run - 469.76 MFLOP/run -   4.13 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7242 runs -   138.49 us/run - 469.76 MFLOP/run -   3.39 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   158.86 us/run - 469.76 MFLOP/run -   2.96 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8892 runs -   113.12 us/run - 587.20 MFLOP/run -   5.19 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7182 runs -   141.11 us/run - 587.20 MFLOP/run -   4.16 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   129.28 us/run - 587.20 MFLOP/run -   4.54 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6327 runs -   158.58 us/run - 587.20 MFLOP/run -   3.70 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5301 runs -   189.32 us/run - 587.20 MFLOP/run -   3.10 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5992 runs -   166.94 us/run - 939.52 MFLOP/run -   5.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4708 runs -   216.16 us/run - 939.52 MFLOP/run -   4.35 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5564 runs -   181.81 us/run - 939.52 MFLOP/run -   5.17 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4387 runs -   232.83 us/run - 939.52 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3638 runs -   277.63 us/run - 939.52 MFLOP/run -   3.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      452 runs -  2216.07 us/run -  60.13 GFLOP/run -  27.13 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      436 runs -  2294.29 us/run -  60.13 GFLOP/run -  26.21 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      430 runs -  2333.07 us/run -  60.13 GFLOP/run -  25.77 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      426 runs -  2354.50 us/run -  60.13 GFLOP/run -  25.54 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      450 runs -  2224.99 us/run -  60.13 GFLOP/run -  27.02 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    44.03 us/run - 117.44 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.81 us/run - 117.44 MFLOP/run -   2.46 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.90 us/run - 117.44 MFLOP/run -   2.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.25 us/run - 117.44 MFLOP/run -   2.09 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    79.10 us/run - 117.44 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.81 us/run - 234.88 MFLOP/run -   4.91 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20022 runs -    50.45 us/run - 234.88 MFLOP/run -   4.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.73 us/run - 234.88 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    59.36 us/run - 234.88 MFLOP/run -   3.96 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    81.78 us/run - 234.88 MFLOP/run -   2.87 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19312 runs -    52.29 us/run - 352.32 MFLOP/run -   6.74 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.05 us/run - 352.32 MFLOP/run -   6.29 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    62.64 us/run - 352.32 MFLOP/run -   5.62 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15620 runs -    64.26 us/run - 352.32 MFLOP/run -   5.48 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    85.25 us/run - 352.32 MFLOP/run -   4.13 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16827 runs -    59.72 us/run - 469.76 MFLOP/run -   7.87 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16401 runs -    61.13 us/run - 469.76 MFLOP/run -   7.68 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14271 runs -    70.17 us/run - 469.76 MFLOP/run -   6.69 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    69.48 us/run - 469.76 MFLOP/run -   6.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    91.45 us/run - 469.76 MFLOP/run -   5.14 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14877 runs -    67.93 us/run - 587.20 MFLOP/run -   8.64 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14193 runs -    71.19 us/run - 587.20 MFLOP/run -   8.25 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12825 runs -    78.98 us/run - 587.20 MFLOP/run -   7.43 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12996 runs -    77.27 us/run - 587.20 MFLOP/run -   7.60 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9747 runs -   104.15 us/run - 587.20 MFLOP/run -   5.64 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10486 runs -    95.93 us/run - 939.52 MFLOP/run -   9.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10058 runs -    99.53 us/run - 939.52 MFLOP/run -   9.44 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9309 runs -   107.87 us/run - 939.52 MFLOP/run -   8.71 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9416 runs -   106.53 us/run - 939.52 MFLOP/run -   8.82 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6741 runs -   150.02 us/run - 939.52 MFLOP/run -   6.26 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      456 runs -  2200.07 us/run -  60.13 GFLOP/run -  27.33 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      450 runs -  2224.65 us/run -  60.13 GFLOP/run -  27.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      428 runs -  2342.33 us/run -  60.13 GFLOP/run -  25.67 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      438 runs -  2284.31 us/run -  60.13 GFLOP/run -  26.32 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      456 runs -  2196.48 us/run -  60.13 GFLOP/run -  27.38 TFLOPS
AMD Radeon RX 6800 XT

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 8290.44 ± 68.15 9310.68 ± 184.78
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 361.90 ± 0.40 346.92 ± 2.04
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 7996.84 ± 80.20 9143.00 ± 181.95
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 332.41 ± 0.99 333.82 ± 0.45
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 7788.58 ± 40.65 8975.54 ± 158.83
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 313.75 ± 0.63 317.04 ± 0.36
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 7645.57 ± 60.32 8879.91 ± 246.04
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 275.61 ± 0.57 299.83 ± 4.70
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 7207.73 ± 13.38 8179.54 ± 114.85
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 265.07 ± 0.14 249.98 ± 0.02
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 1460.24 ± 0.64 1651.88 ± 2.28
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 85.71 ± 0.05 86.02 ± 0.01
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 1421.29 ± 2.25 1602.68 ± 2.05
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 77.33 ± 0.15 79.11 ± 0.02
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 1247.77 ± 0.86 1391.76 ± 1.91
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 54.01 ± 0.06 53.82 ± 0.01
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    45.06 us/run - 117.44 MFLOP/run -   2.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.43 us/run - 117.44 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15336 runs -    67.74 us/run - 117.44 MFLOP/run -   1.73 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    71.62 us/run - 117.44 MFLOP/run -   1.64 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.38 us/run - 117.44 MFLOP/run -   2.08 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    59.68 us/run - 234.88 MFLOP/run -   3.94 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13206 runs -    75.97 us/run - 234.88 MFLOP/run -   3.09 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    83.24 us/run - 234.88 MFLOP/run -   2.82 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    94.06 us/run - 234.88 MFLOP/run -   2.50 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.19 us/run - 234.88 MFLOP/run -   2.73 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13632 runs -    73.55 us/run - 352.32 MFLOP/run -   4.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    99.13 us/run - 352.32 MFLOP/run -   3.55 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    97.12 us/run - 352.32 MFLOP/run -   3.63 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8804 runs -   116.30 us/run - 352.32 MFLOP/run -   3.03 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   119.58 us/run - 352.32 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    99.28 us/run - 469.76 MFLOP/run -   4.73 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8307 runs -   120.53 us/run - 469.76 MFLOP/run -   3.90 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9159 runs -   110.02 us/run - 469.76 MFLOP/run -   4.27 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7242 runs -   138.17 us/run - 469.76 MFLOP/run -   3.40 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   168.71 us/run - 469.76 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   129.20 us/run - 587.20 MFLOP/run -   4.54 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7011 runs -   143.02 us/run - 587.20 MFLOP/run -   4.11 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6669 runs -   150.73 us/run - 587.20 MFLOP/run -   3.90 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6327 runs -   160.34 us/run - 587.20 MFLOP/run -   3.66 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   211.90 us/run - 587.20 MFLOP/run -   2.77 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5029 runs -   200.66 us/run - 939.52 MFLOP/run -   4.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4815 runs -   208.91 us/run - 939.52 MFLOP/run -   4.50 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4173 runs -   240.01 us/run - 939.52 MFLOP/run -   3.91 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4387 runs -   230.94 us/run - 939.52 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2782 runs -   373.07 us/run - 939.52 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      354 runs -  2832.15 us/run -  60.13 GFLOP/run -  21.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      346 runs -  2893.23 us/run -  60.13 GFLOP/run -  20.78 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      338 runs -  2970.86 us/run -  60.13 GFLOP/run -  20.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      332 runs -  3017.38 us/run -  60.13 GFLOP/run -  19.93 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      304 runs -  3298.77 us/run -  60.13 GFLOP/run -  18.23 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              33228 runs -    30.60 us/run - 117.44 MFLOP/run -   3.84 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              34080 runs -    30.07 us/run - 117.44 MFLOP/run -   3.91 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    44.67 us/run - 117.44 MFLOP/run -   2.63 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25560 runs -    40.09 us/run - 117.44 MFLOP/run -   2.93 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              26412 runs -    38.47 us/run - 117.44 MFLOP/run -   3.05 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23856 runs -    41.98 us/run - 234.88 MFLOP/run -   5.60 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23856 runs -    42.50 us/run - 234.88 MFLOP/run -   5.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.52 us/run - 234.88 MFLOP/run -   4.01 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19596 runs -    52.02 us/run - 234.88 MFLOP/run -   4.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.66 us/run - 234.88 MFLOP/run -   4.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19028 runs -    53.15 us/run - 352.32 MFLOP/run -   6.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.06 us/run - 352.32 MFLOP/run -   6.28 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15052 runs -    67.57 us/run - 352.32 MFLOP/run -   5.21 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15904 runs -    63.25 us/run - 352.32 MFLOP/run -   5.57 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.08 us/run - 352.32 MFLOP/run -   5.03 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15336 runs -    65.73 us/run - 469.76 MFLOP/run -   7.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.68 us/run - 469.76 MFLOP/run -   7.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.39 us/run - 469.76 MFLOP/run -   5.99 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13206 runs -    76.58 us/run - 469.76 MFLOP/run -   6.13 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    84.79 us/run - 469.76 MFLOP/run -   5.54 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12996 runs -    76.95 us/run - 587.20 MFLOP/run -   7.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12996 runs -    77.88 us/run - 587.20 MFLOP/run -   7.54 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11115 runs -    90.56 us/run - 587.20 MFLOP/run -   6.48 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11457 runs -    87.54 us/run - 587.20 MFLOP/run -   6.71 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9918 runs -   101.76 us/run - 587.20 MFLOP/run -   5.77 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8881 runs -   112.68 us/run - 939.52 MFLOP/run -   8.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8774 runs -   114.75 us/run - 939.52 MFLOP/run -   8.19 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7918 runs -   127.01 us/run - 939.52 MFLOP/run -   7.40 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7918 runs -   126.95 us/run - 939.52 MFLOP/run -   7.40 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6741 runs -   149.44 us/run - 939.52 MFLOP/run -   6.29 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      424 runs -  2361.22 us/run -  60.13 GFLOP/run -  25.47 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      412 runs -  2433.19 us/run -  60.13 GFLOP/run -  24.71 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      406 runs -  2469.37 us/run -  60.13 GFLOP/run -  24.35 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      370 runs -  2706.55 us/run -  60.13 GFLOP/run -  22.22 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      346 runs -  2890.54 us/run -  60.13 GFLOP/run -  20.80 TFLOPS
AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 4319.68 ± 10.73 4166.32 ± 27.82
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 237.71 ± 7.57 225.20 ± 10.54
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 3846.38 ± 24.25 3821.60 ± 8.56
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 188.97 ± 0.75 246.49 ± 2.26
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 4089.88 ± 14.79 3985.56 ± 15.74
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 185.73 ± 1.35 199.39 ± 7.51
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 3694.51 ± 13.06 3686.57 ± 14.67
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 175.89 ± 0.34 225.69 ± 2.53
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 3858.94 ± 10.64 3830.51 ± 13.83
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 190.62 ± 1.98 198.15 ± 2.20
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 681.63 ± 0.85 708.55 ± 0.63
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 64.35 ± 0.59 72.14 ± 0.92
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 617.22 ± 0.26 651.25 ± 0.74
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 55.57 ± 0.09 83.98 ± 1.04
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 614.20 ± 0.14 633.82 ± 0.79
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 45.15 ± 0.66 50.87 ± 0.19
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    94.62 us/run - 117.44 MFLOP/run -   1.24 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   114.40 us/run - 117.44 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   136.06 us/run - 117.44 MFLOP/run - 863.17 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.48 us/run - 117.44 MFLOP/run - 785.65 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.81 us/run - 117.44 MFLOP/run - 783.92 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   121.36 us/run - 234.88 MFLOP/run -   1.94 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   180.98 us/run - 234.88 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   165.74 us/run - 234.88 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   205.79 us/run - 234.88 MFLOP/run -   1.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.30 us/run - 234.88 MFLOP/run -   1.20 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.00 us/run - 352.32 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4544 runs -   229.42 us/run - 352.32 MFLOP/run -   1.54 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5396 runs -   189.78 us/run - 352.32 MFLOP/run -   1.86 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   259.73 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   258.44 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   185.89 us/run - 469.76 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3621 runs -   278.23 us/run - 469.76 MFLOP/run -   1.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   218.55 us/run - 469.76 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   308.20 us/run - 469.76 MFLOP/run -   1.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2769 runs -   383.07 us/run - 469.76 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4446 runs -   225.21 us/run - 587.20 MFLOP/run -   2.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3078 runs -   330.66 us/run - 587.20 MFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4104 runs -   250.37 us/run - 587.20 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2736 runs -   366.23 us/run - 587.20 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   453.64 us/run - 587.20 MFLOP/run -   1.29 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.33 us/run - 939.52 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   682.17 us/run - 939.52 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   336.23 us/run - 939.52 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1391 runs -   724.83 us/run - 939.52 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   680.78 us/run - 939.52 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7389.38 us/run -  60.13 GFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7842.16 us/run -  60.13 GFLOP/run -   7.67 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7683.52 us/run -  60.13 GFLOP/run -   7.83 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  7996.14 us/run -  60.13 GFLOP/run -   7.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      124 runs -  8115.40 us/run -  60.13 GFLOP/run -   7.41 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13632 runs -    73.72 us/run - 117.44 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.84 us/run - 117.44 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   111.69 us/run - 117.44 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    80.30 us/run - 117.44 MFLOP/run -   1.46 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   128.43 us/run - 117.44 MFLOP/run - 914.40 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    84.72 us/run - 234.88 MFLOP/run -   2.77 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.82 us/run - 234.88 MFLOP/run -   3.32 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8094 runs -   124.44 us/run - 234.88 MFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    95.75 us/run - 234.88 MFLOP/run -   2.45 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   148.05 us/run - 234.88 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    97.60 us/run - 352.32 MFLOP/run -   3.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11360 runs -    88.90 us/run - 352.32 MFLOP/run -   3.96 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7384 runs -   137.00 us/run - 352.32 MFLOP/run -   2.57 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   112.16 us/run - 352.32 MFLOP/run -   3.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6248 runs -   167.61 us/run - 352.32 MFLOP/run -   2.10 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8946 runs -   113.21 us/run - 469.76 MFLOP/run -   4.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9798 runs -   104.24 us/run - 469.76 MFLOP/run -   4.51 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6603 runs -   153.85 us/run - 469.76 MFLOP/run -   3.05 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7881 runs -   130.09 us/run - 469.76 MFLOP/run -   3.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.63 us/run - 469.76 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8037 runs -   126.66 us/run - 587.20 MFLOP/run -   4.64 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8379 runs -   120.66 us/run - 587.20 MFLOP/run -   4.87 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6156 runs -   166.61 us/run - 587.20 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7011 runs -   143.56 us/run - 587.20 MFLOP/run -   4.09 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4959 runs -   205.72 us/run - 587.20 MFLOP/run -   2.85 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5778 runs -   174.42 us/run - 939.52 MFLOP/run -   5.39 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5778 runs -   175.33 us/run - 939.52 MFLOP/run -   5.36 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4815 runs -   208.77 us/run - 939.52 MFLOP/run -   4.50 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5136 runs -   197.36 us/run - 939.52 MFLOP/run -   4.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3745 runs -   269.83 us/run - 939.52 MFLOP/run -   3.48 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7794.81 us/run -  60.13 GFLOP/run -   7.71 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      122 runs -  8280.09 us/run -  60.13 GFLOP/run -   7.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8040.53 us/run -  60.13 GFLOP/run -   7.48 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      120 runs -  8455.30 us/run -  60.13 GFLOP/run -   7.11 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      118 runs -  8612.27 us/run -  60.13 GFLOP/run -   6.98 TFLOPS
Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s (Master) t/s (PR)
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 pp512 3848.99 ± 230.91 4224.15 ± 272.97
llama 1B Q4_0 606.53 MiB 1.10 B Vulkan 99 tg128 116.43 ± 1.40 120.02 ± 0.12
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 pp512 3844.84 ± 230.45 4211.68 ± 262.99
llama 1B Q4_1 668.18 MiB 1.10 B Vulkan 99 tg128 115.23 ± 2.04 132.14 ± 0.05
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 pp512 3700.45 ± 205.58 4016.62 ± 236.41
llama 1B Q5_0 729.84 MiB 1.10 B Vulkan 99 tg128 58.72 ± 0.07 78.30 ± 0.09
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 pp512 3730.14 ± 212.64 4073.48 ± 250.03
llama 1B Q5_1 791.50 MiB 1.10 B Vulkan 99 tg128 102.79 ± 0.14 117.86 ± 0.04
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 pp512 3551.92 ± 194.54 3872.60 ± 232.64
llama 1B Q8_0 1.09 GiB 1.10 B Vulkan 99 tg128 53.32 ± 0.11 120.84 ± 0.04
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 739.69 ± 0.93 843.71 ± 0.86
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 32.68 ± 0.03 33.73 ± 0.04
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 pp512 740.14 ± 1.95 839.24 ± 1.23
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 tg128 32.51 ± 0.05 41.05 ± 0.01
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 657.79 ± 1.01 737.46 ± 1.02
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 9.85 ± 0.00 32.37 ± 0.05
Master:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.17 us/run - 117.44 MFLOP/run - 787.29 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   153.80 us/run - 117.44 MFLOP/run - 763.62 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2556 runs -   554.93 us/run - 117.44 MFLOP/run - 211.63 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   198.64 us/run - 117.44 MFLOP/run - 591.24 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   819.78 us/run - 117.44 MFLOP/run - 143.26 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.23 us/run - 234.88 MFLOP/run -   1.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.86 us/run - 234.88 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   482.87 us/run - 234.88 MFLOP/run - 486.43 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   203.09 us/run - 234.88 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1278 runs -   963.53 us/run - 234.88 MFLOP/run - 243.77 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6248 runs -   165.31 us/run - 352.32 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   207.60 us/run - 352.32 MFLOP/run -   1.70 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1988 runs -   515.32 us/run - 352.32 MFLOP/run - 683.69 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   218.91 us/run - 352.32 MFLOP/run -   1.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   652.91 us/run - 352.32 MFLOP/run - 539.61 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.39 us/run - 469.76 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4047 runs -   253.83 us/run - 469.76 MFLOP/run -   1.85 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   507.43 us/run - 469.76 MFLOP/run - 925.76 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4047 runs -   249.59 us/run - 469.76 MFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1491 runs -   673.97 us/run - 469.76 MFLOP/run - 697.00 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3420 runs -   305.60 us/run - 587.20 MFLOP/run -   1.92 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2394 runs -   446.38 us/run - 587.20 MFLOP/run -   1.32 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1710 runs -   623.58 us/run - 587.20 MFLOP/run - 941.67 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   477.75 us/run - 587.20 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1197 runs -   924.28 us/run - 587.20 MFLOP/run - 635.31 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3103 runs -   326.34 us/run - 939.52 MFLOP/run -   2.88 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2140 runs -   473.78 us/run - 939.52 MFLOP/run -   1.98 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1712 runs -   585.79 us/run - 939.52 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2247 runs -   453.44 us/run - 939.52 MFLOP/run -   2.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                963 runs -  1087.41 us/run - 939.52 MFLOP/run - 864.00 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      172 runs -  5838.30 us/run -  60.13 GFLOP/run -  10.30 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5495.88 us/run -  60.13 GFLOP/run -  10.94 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      176 runs -  5734.22 us/run -  60.13 GFLOP/run -  10.49 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      162 runs -  6229.44 us/run -  60.13 GFLOP/run -   9.65 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6299.02 us/run -  60.13 GFLOP/run -   9.55 TFLOPS

PR:
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   168.10 us/run - 117.44 MFLOP/run - 698.62 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   114.36 us/run - 117.44 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2556 runs -   404.84 us/run - 117.44 MFLOP/run - 290.09 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   153.53 us/run - 117.44 MFLOP/run - 764.95 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   190.12 us/run - 117.44 MFLOP/run - 617.70 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   172.44 us/run - 234.88 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   146.98 us/run - 234.88 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2982 runs -   339.21 us/run - 234.88 MFLOP/run - 692.43 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   206.71 us/run - 234.88 MFLOP/run -   1.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   211.92 us/run - 234.88 MFLOP/run -   1.11 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3692 runs -   277.84 us/run - 352.32 MFLOP/run -   1.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6532 runs -   155.90 us/run - 352.32 MFLOP/run -   2.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2272 runs -   448.46 us/run - 352.32 MFLOP/run - 785.62 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2840 runs -   361.84 us/run - 352.32 MFLOP/run - 973.69 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   264.50 us/run - 352.32 MFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3195 runs -   332.54 us/run - 469.76 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3195 runs -   324.34 us/run - 469.76 MFLOP/run -   1.45 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   484.14 us/run - 469.76 MFLOP/run - 970.31 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2769 runs -   378.15 us/run - 469.76 MFLOP/run -   1.24 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   267.34 us/run - 469.76 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2736 runs -   366.77 us/run - 587.20 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2565 runs -   392.54 us/run - 587.20 MFLOP/run -   1.50 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1881 runs -   558.01 us/run - 587.20 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2052 runs -   511.96 us/run - 587.20 MFLOP/run -   1.15 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   350.20 us/run - 587.20 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1605 runs -   639.74 us/run - 939.52 MFLOP/run -   1.47 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   706.44 us/run - 939.52 MFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1284 runs -   817.76 us/run - 939.52 MFLOP/run -   1.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1391 runs -   750.15 us/run - 939.52 MFLOP/run -   1.25 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1712 runs -   590.37 us/run - 939.52 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      222 runs -  4508.30 us/run -  60.13 GFLOP/run -  13.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      216 runs -  4641.71 us/run -  60.13 GFLOP/run -  12.95 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      202 runs -  4961.32 us/run -  60.13 GFLOP/run -  12.12 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      208 runs -  4838.35 us/run -  60.13 GFLOP/run -  12.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      186 runs -  5414.21 us/run -  60.13 GFLOP/run -  11.11 TFLOPS

@jeffbolznv
Copy link
Collaborator

Some quick results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_1.gguf -m c:\models\Meta-Llama-3-8B.Q5_0.gguf -m c:\models\Meta-Llama-3-8B.Q5_1.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        237.90 ± 0.57 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        218.49 ± 3.46 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        188.38 ± 6.95 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        171.80 ± 3.68 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        161.96 ± 2.28 |

build: 6c7a4411 (6076)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_0.gguf -m c:\models\Meta-Llama-3-8B.Q4_1.gguf -m c:\models\Meta-Llama-3-8B.Q5_0.gguf -m c:\models\Meta-Llama-3-8B.Q5_1.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        224.81 ± 0.64 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        205.56 ± 7.02 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        191.34 ± 5.13 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        176.42 ± 4.35 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        172.84 ± 5.79 |

build: 32585e7c (6072)

before:

  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                16368 runs -    63.20 us/run - 134.48 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                14136 runs -    70.81 us/run - 134.48 MFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              62196 runs -    16.20 us/run - 117.44 MFLOP/run -   7.25 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              56232 runs -    17.80 us/run - 117.44 MFLOP/run -   6.60 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              37488 runs -    27.04 us/run - 117.44 MFLOP/run -   4.34 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              36636 runs -    27.44 us/run - 117.44 MFLOP/run -   4.28 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              49416 runs -    20.55 us/run - 117.44 MFLOP/run -   5.72 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              48564 runs -    20.60 us/run - 234.88 MFLOP/run -  11.40 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              44304 runs -    22.61 us/run - 234.88 MFLOP/run -  10.39 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              32376 runs -    31.23 us/run - 234.88 MFLOP/run -   7.52 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              32802 runs -    30.56 us/run - 234.88 MFLOP/run -   7.69 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              32376 runs -    31.24 us/run - 234.88 MFLOP/run -   7.52 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              37772 runs -    26.58 us/run - 352.32 MFLOP/run -  13.26 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              30956 runs -    32.55 us/run - 352.32 MFLOP/run -  10.82 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25844 runs -    38.75 us/run - 352.32 MFLOP/run -   9.09 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24140 runs -    41.79 us/run - 352.32 MFLOP/run -   8.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21868 runs -    46.05 us/run - 352.32 MFLOP/run -   7.65 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    33.66 us/run - 469.76 MFLOP/run -  13.96 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25347 runs -    39.74 us/run - 469.76 MFLOP/run -  11.82 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              22152 runs -    45.45 us/run - 469.76 MFLOP/run -  10.34 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.12 us/run - 469.76 MFLOP/run -   9.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18318 runs -    55.05 us/run - 469.76 MFLOP/run -   8.53 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24111 runs -    41.58 us/run - 587.20 MFLOP/run -  14.12 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20007 runs -    50.41 us/run - 587.20 MFLOP/run -  11.65 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19152 runs -    52.62 us/run - 587.20 MFLOP/run -  11.16 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17613 runs -    57.09 us/run - 587.20 MFLOP/run -  10.28 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14706 runs -    68.69 us/run - 587.20 MFLOP/run -   8.55 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12305 runs -    81.67 us/run - 939.52 MFLOP/run -  11.50 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12305 runs -    81.49 us/run - 939.52 MFLOP/run -  11.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10914 runs -    92.02 us/run - 939.52 MFLOP/run -  10.21 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11021 runs -    91.32 us/run - 939.52 MFLOP/run -  10.29 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4708 runs -   216.60 us/run - 939.52 MFLOP/run -   4.34 TFLOPS
  
after:

  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                16368 runs -    63.43 us/run - 134.48 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                14136 runs -    71.03 us/run - 134.48 MFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              88608 runs -    11.36 us/run - 117.44 MFLOP/run -  10.33 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              88608 runs -    11.36 us/run - 117.44 MFLOP/run -  10.34 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              72420 runs -    13.82 us/run - 117.44 MFLOP/run -   8.50 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              77532 runs -    13.03 us/run - 117.44 MFLOP/run -   9.02 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              62196 runs -    16.25 us/run - 117.44 MFLOP/run -   7.23 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              73272 runs -    13.68 us/run - 234.88 MFLOP/run -  17.17 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              70716 runs -    14.22 us/run - 234.88 MFLOP/run -  16.51 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              56658 runs -    17.65 us/run - 234.88 MFLOP/run -  13.31 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              65178 runs -    15.37 us/run - 234.88 MFLOP/run -  15.28 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              49416 runs -    20.36 us/run - 234.88 MFLOP/run -  11.54 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              55096 runs -    18.23 us/run - 352.32 MFLOP/run -  19.33 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              56232 runs -    17.87 us/run - 352.32 MFLOP/run -  19.72 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              48848 runs -    20.53 us/run - 352.32 MFLOP/run -  17.16 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              51688 runs -    19.39 us/run - 352.32 MFLOP/run -  18.17 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              39760 runs -    25.21 us/run - 352.32 MFLOP/run -  13.97 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              47712 runs -    20.97 us/run - 469.76 MFLOP/run -  22.40 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              46221 runs -    21.67 us/run - 469.76 MFLOP/run -  21.68 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              42387 runs -    23.60 us/run - 469.76 MFLOP/run -  19.91 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              43239 runs -    23.23 us/run - 469.76 MFLOP/run -  20.22 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              31950 runs -    31.50 us/run - 469.76 MFLOP/run -  14.91 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              40356 runs -    24.85 us/run - 587.20 MFLOP/run -  23.63 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              39330 runs -    25.43 us/run - 587.20 MFLOP/run -  23.09 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              35739 runs -    28.06 us/run - 587.20 MFLOP/run -  20.92 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              37278 runs -    26.91 us/run - 587.20 MFLOP/run -  21.82 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              27702 runs -    36.21 us/run - 587.20 MFLOP/run -  16.22 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              27927 runs -    35.85 us/run - 939.52 MFLOP/run -  26.21 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              22791 runs -    44.02 us/run - 939.52 MFLOP/run -  21.34 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              25680 runs -    39.01 us/run - 939.52 MFLOP/run -  24.08 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21186 runs -    47.32 us/run - 939.52 MFLOP/run -  19.85 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19581 runs -    51.14 us/run - 939.52 MFLOP/run -  18.37 TFLOPS

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 3, 2025

Thank you, that shows I'm on the right path.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from 32585e7 to afc464a Compare August 17, 2025 14:01
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 17, 2025

@jeffbolznv I retested this and find that it now improves tg performance in most of my tests. I did an Nvidia driver update to 580.76.05, not sure if that helped. I think this is ready to merge, but I'll wait until #15355 to fix the subgroup reduce conflict.

Can you give this another try and let me know if you have any concerns?

Nvidia RTX 3090 (without coopmat)

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1979.04 ± 9.97 1983.01 ± 14.40 +0.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 139.60 ± 0.42 146.23 ± 0.31 +4.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1973.27 ± 7.60 1986.97 ± 12.57 +0.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 141.57 ± 0.35 145.54 ± 0.31 +2.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1916.91 ± 7.65 1930.39 ± 12.38 +0.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 125.68 ± 2.17 130.67 ± 0.16 +4.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1854.64 ± 63.76 1875.40 ± 25.48 +1.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 122.26 ± 0.80 126.48 ± 1.52 +3.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1859.86 ± 9.22 1907.00 ± 12.33 +2.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 117.76 ± 0.47 123.85 ± 0.82 +5.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1785.04 ± 37.53 1836.46 ± 29.76 +2.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 115.01 ± 0.18 120.49 ± 0.35 +4.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1884.02 ± 11.26 1880.50 ± 32.65 -0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 84.67 ± 0.13 85.86 ± 0.46 +1.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1837.75 ± 25.64 1808.84 ± 22.32 -1.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 83.07 ± 0.12 84.15 ± 0.08 +1.3%

AMD Radeon RX 6800 XT

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1498.28 ± 2.62 1701.59 ± 2.37 +13.6%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 99.44 ± 0.02 96.28 ± 0.03 -3.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1475.39 ± 0.59 1677.09 ± 1.45 +13.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 98.39 ± 0.02 95.10 ± 0.02 -3.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1451.42 ± 1.19 1640.18 ± 1.57 +13.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 88.29 ± 0.03 88.86 ± 0.02 +0.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1395.78 ± 0.47 1576.13 ± 0.42 +12.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 83.35 ± 0.02 84.49 ± 0.01 +1.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1413.33 ± 1.49 1592.03 ± 1.74 +12.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 79.65 ± 0.02 81.68 ± 0.01 +2.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1360.71 ± 0.40 1530.01 ± 0.55 +12.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 75.42 ± 0.01 77.57 ± 0.03 +2.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1240.33 ± 1.34 1382.96 ± 1.47 +11.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 55.32 ± 0.01 55.17 ± 0.01 -0.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1203.89 ± 0.64 1340.01 ± 0.26 +11.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 52.84 ± 0.01 52.75 ± 0.01 -0.2%

AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 806.60 ± 0.64 889.24 ± 4.58 +10.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 75.75 ± 0.55 96.14 ± 0.32 +26.9%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 719.33 ± 0.67 767.71 ± 0.88 +6.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 77.48 ± 0.26 98.92 ± 0.34 +27.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 718.33 ± 1.49 735.30 ± 0.58 +2.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 69.15 ± 0.35 87.23 ± 1.04 +26.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 650.45 ± 0.71 665.57 ± 0.23 +2.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 66.37 ± 0.34 79.83 ± 0.54 +20.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 615.49 ± 0.40 645.20 ± 0.72 +4.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 60.94 ± 0.07 85.82 ± 0.31 +40.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 565.78 ± 1.11 590.54 ± 0.68 +4.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 57.84 ± 0.53 80.80 ± 0.13 +39.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 630.28 ± 0.27 643.85 ± 0.23 +2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 58.55 ± 0.49 63.85 ± 0.11 +9.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 576.81 ± 0.27 586.58 ± 4.07 +1.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 55.46 ± 0.20 60.79 ± 0.04 +9.6%

Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 642.91 ± 0.53 742.16 ± 0.63 +15.4%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 40.58 ± 0.10 42.26 ± 0.02 +4.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 230.80 ± 0.24 242.04 ± 0.26 +4.9%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 41.18 ± 0.07 42.89 ± 0.05 +4.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 739.25 ± 0.74 829.59 ± 0.38 +12.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 33.77 ± 0.03 34.99 ± 0.03 +3.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 239.08 ± 0.17 251.47 ± 0.11 +5.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 26.13 ± 0.03 26.40 ± 0.01 +1.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 736.43 ± 1.42 820.08 ± 3.56 +11.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 33.52 ± 0.01 41.40 ± 0.00 +23.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 240.77 ± 0.10 251.12 ± 0.10 +4.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 25.97 ± 0.02 29.82 ± 0.03 +14.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 655.59 ± 1.38 731.98 ± 0.77 +11.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 9.93 ± 0.01 33.60 ± 0.01 +238.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 231.13 ± 0.29 241.90 ± 0.07 +4.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 9.14 ± 0.01 25.75 ± 0.03 +181.7%

@jeffbolznv
Copy link
Collaborator

Sure, I'd like to retest this after it's rebased past #15355, so I can see how it interacts with the different workgroup sizes. But this looks really promising.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from afc464a to 39d620a Compare August 17, 2025 19:13
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 17, 2025

I fixed a quantization bug and did the bare minimum to make this work side by side with #15355. Combining the optimizations is messier than I thought, especially with now three variants of the reduce function. I tried using yours, but that is measurably slower than my subgroup-only variant (probably due to no shared memory). I guess I might need 3 variants of my shader, and maybe that is also worth it for your DMMV_WG_SIZE_SUBGROUP path. I'll take another look tomorrow.

@jeffbolznv
Copy link
Collaborator

I'm still seeing slowdowns, particularly for Q8_0 and usually (but not always) for Q4_0:

5090 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        221.60 ± 2.15 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        273.46 ± 0.55 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        175.91 ± 4.49 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         62.26 ± 0.22 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       347.20 ± 22.44 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        201.33 ± 9.39 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        190.95 ± 6.49 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        172.70 ± 7.62 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        168.39 ± 3.16 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        144.10 ± 6.53 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        150.18 ± 7.15 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |       239.63 ± 12.84 |

5090 after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        216.56 ± 1.40 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        249.90 ± 1.82 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        178.59 ± 7.13 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.79 ± 0.04 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        319.19 ± 6.72 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        200.42 ± 4.91 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        193.13 ± 2.99 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        177.13 ± 6.04 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        174.55 ± 4.17 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        140.05 ± 4.92 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        144.76 ± 6.34 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        228.56 ± 1.37 |

4070 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        102.36 ± 0.18 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        117.95 ± 0.11 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         79.08 ± 1.66 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        183.58 ± 0.47 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         92.25 ± 1.71 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         84.83 ± 1.64 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         77.66 ± 1.51 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         72.70 ± 1.23 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         55.01 ± 0.04 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |         57.48 ± 0.52 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        116.96 ± 0.40 |

4070 after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |         99.57 ± 0.22 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        114.50 ± 0.26 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |         78.76 ± 0.59 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        177.88 ± 0.39 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         90.56 ± 1.49 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         84.61 ± 0.20 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         77.60 ± 0.82 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         72.65 ± 0.52 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         54.14 ± 0.11 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |         56.69 ± 0.57 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        112.93 ± 0.18 |

I tried using yours, but that is measurably slower than my subgroup-only variant (probably due to no shared memory).

Can this just have a runtime check and avoid shared memory when there's only one subgroup?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 18, 2025

I tried using yours, but that is measurably slower than my subgroup-only variant (probably due to no shared memory).

Can this just have a runtime check and avoid shared memory when there's only one subgroup?

Does the shared memory get optimized out at runtime if it is not used? Maybe just with a specialization constant? I always have some doubts, especially about AMD and Intel optimizers.

@jeffbolznv
Copy link
Collaborator

I think the shared memory is only guaranteed to be optimized out if it's not statically used, so guarding it with a spec constant wouldn't be sufficient.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 18, 2025

@jeffbolznv I unified the subgroup modes and applied the small m optimization to the integer dot shader too, but it just caused a slowdown. In the code it's currently disabled:

    if (b_type == GGML_TYPE_Q8_1) {
        return ctx->device->pipeline_dequant_mul_mat_vec_q8_1_f32[DMMV_WG_SIZE_SUBGROUP][a_type][num_cols-1];
    }

You can replace DMMV_WG_SIZE_SUBGROUP with dmmv_wg to apply your optimization. Do you have any ideas why they don't work together?

@jeffbolznv
Copy link
Collaborator

I see about a 1% increase across most models by using dmmv_wg on 5090. I think this is in line with what I saw in the original change, but I only tested a couple legacy quant models. It seems to help k quants more.

@jeffbolznv
Copy link
Collaborator

I've noticed that some models (llama and qwen, at least?) will reuse the same vector for multiple mat muls. If you could reuse the quantization result, this should be a win more often. And this could also benefit some prompt processing cases. I think Q8_0 is the least likely to ever show a benefit for tg, since it's the most bandwidth-limited.

@jeffbolznv
Copy link
Collaborator

I've noticed that some models (llama and qwen, at least?) will reuse the same vector for multiple mat muls. If you could reuse the quantization result, this should be a win more often. And this could also benefit some prompt processing cases.

I went ahead and added infrastructure for this in #15410. Should be simple to extend it to handle your new path.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from ec3ec03 to 730ba00 Compare August 21, 2025 15:48
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 21, 2025

The vector reuse thing was a good idea, here are updated results:

Nvidia RTX 3090 (without coopmat)

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 2026.20 ± 3.66 2041.47 ± 6.93 +0.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 140.97 ± 0.54 142.77 ± 10.62 +1.3%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 2035.22 ± 3.19 2036.84 ± 10.87 +0.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 140.02 ± 1.47 146.64 ± 2.76 +4.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1965.03 ± 9.28 1969.27 ± 9.15 +0.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 125.02 ± 0.93 132.14 ± 0.91 +5.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1935.37 ± 8.67 1954.69 ± 6.31 +1.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 123.14 ± 1.02 131.55 ± 1.38 +6.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1897.59 ± 8.86 1940.74 ± 12.20 +2.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 120.19 ± 0.57 126.01 ± 1.06 +4.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1877.20 ± 8.70 1915.73 ± 12.24 +2.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 116.09 ± 0.83 122.17 ± 0.92 +5.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1919.67 ± 4.49 1933.66 ± 8.28 +0.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 87.51 ± 0.30 87.53 ± 0.11 +0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1891.76 ± 13.38 1897.69 ± 12.15 +0.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 86.34 ± 0.26 85.98 ± 0.20 -0.4%

AMD Radeon RX 6800 XT

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1501.11 ± 2.62 1699.58 ± 2.54 +13.2%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 99.32 ± 0.02 96.59 ± 0.80 -2.7%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1479.86 ± 0.78 1674.24 ± 0.82 +13.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 98.23 ± 0.02 95.47 ± 0.01 -2.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1452.99 ± 1.74 1642.41 ± 1.11 +13.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 88.33 ± 0.03 89.33 ± 0.01 +1.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1399.49 ± 0.34 1575.58 ± 0.60 +12.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 83.51 ± 0.02 84.70 ± 0.03 +1.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1413.64 ± 1.21 1593.95 ± 1.98 +12.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 79.98 ± 0.01 82.04 ± 0.01 +2.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1365.12 ± 0.32 1530.70 ± 1.46 +12.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 76.02 ± 0.01 77.66 ± 0.02 +2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1243.29 ± 0.79 1385.51 ± 1.10 +11.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 55.53 ± 0.01 55.40 ± 0.01 -0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1206.37 ± 0.50 1338.56 ± 0.56 +11.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 52.97 ± 0.00 52.92 ± 0.01 -0.1%

AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 820.04 ± 4.81 903.12 ± 0.77 +10.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 76.05 ± 0.41 99.26 ± 0.12 +30.5%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 725.11 ± 0.84 776.47 ± 0.96 +7.1%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 77.94 ± 0.34 100.55 ± 1.10 +29.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 723.72 ± 0.64 737.25 ± 1.64 +1.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 69.50 ± 0.45 88.40 ± 1.56 +27.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 655.13 ± 0.70 666.96 ± 0.93 +1.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 66.54 ± 0.31 81.72 ± 0.27 +22.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 618.58 ± 0.30 647.37 ± 0.51 +4.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 61.02 ± 0.14 89.04 ± 0.30 +45.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 568.47 ± 0.38 592.72 ± 0.66 +4.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 58.54 ± 0.09 82.60 ± 0.46 +41.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 631.94 ± 0.52 646.75 ± 0.76 +2.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 58.38 ± 0.25 65.25 ± 0.18 +11.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 578.33 ± 0.40 591.40 ± 2.00 +2.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 56.35 ± 0.17 62.21 ± 0.12 +10.4%

Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 642.87 ± 0.72 744.59 ± 0.69 +15.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 43.70 ± 0.07 46.30 ± 0.05 +5.9%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 231.35 ± 0.18 242.53 ± 0.17 +4.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 44.48 ± 0.06 46.95 ± 0.07 +5.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 738.90 ± 6.67 825.71 ± 1.69 +11.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 37.27 ± 0.07 37.18 ± 0.11 -0.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 241.38 ± 0.22 250.03 ± 0.08 +3.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 27.89 ± 0.09 27.74 ± 0.06 -0.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 740.37 ± 2.37 820.61 ± 1.95 +10.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 33.07 ± 0.02 42.39 ± 0.09 +28.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 241.59 ± 0.29 251.27 ± 0.21 +4.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 25.51 ± 0.06 30.24 ± 0.13 +18.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 659.15 ± 1.29 733.99 ± 1.75 +11.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 10.85 ± 0.01 33.89 ± 0.06 +212.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 231.58 ± 0.13 241.94 ± 0.27 +4.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 9.90 ± 0.03 25.98 ± 0.01 +162.4%

@jeffbolznv
Copy link
Collaborator

5090 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        221.08 ± 1.41 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        271.33 ± 2.30 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        175.83 ± 6.33 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         62.29 ± 0.32 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        355.60 ± 3.89 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       202.62 ± 10.30 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        189.65 ± 7.77 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        174.81 ± 4.38 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        168.11 ± 4.52 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        148.07 ± 2.85 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        151.36 ± 7.90 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        244.13 ± 1.13 |

5090 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m C:\models\glm-4-9b-chat-Q4_0.gguf -m C:\models\GLM-4-32B-0414-Q4_0.gguf -m C:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_0.gguf -m C:\models\Meta-Llama-3-8B.Q4_1.gguf -m C:\models\Meta-Llama-3-8B.Q5_0.gguf -m C:\models\Meta-Llama-3-8B.Q5_1.gguf -m C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m C:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m C:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        227.18 ± 2.28 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        259.18 ± 3.05 |
| chatglm 9B Q4_0                |   5.08 GiB |     9.40 B | Vulkan     |  99 |  1 |           tg128 |        179.72 ± 7.38 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         62.74 ± 0.24 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       328.61 ± 24.36 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       206.17 ± 11.95 |
| llama 8B Q4_1                  |   4.77 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        195.52 ± 4.37 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        182.18 ± 4.68 |
| llama 8B Q5_1                  |   5.64 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        173.81 ± 6.15 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        141.36 ± 5.42 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        147.06 ± 6.01 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        235.77 ± 0.62 |

It's a bit surprising that llama 3B Q4_0 is so much slower when the other Q4_0s are faster. Might be worth looking into whether this is a general problem with smaller models. But otherwise this looks like a good improvement and seems fine to enable for NVIDIA (but disable for Q8_0).

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 21, 2025

Maybe with small models the overhead from calling the quantization shader stays larger, while the improvement from the mmvq shader shrinks. I don't think that's a big problem, but maybe a minimum vector size before mmvq gets enabled would solve it?

I'll probably also add an env variable to disable this path separate from the complete integer dot disable.

It would also be interesting to see how this performs on Nvidia Pascal and Turing, to see if they require different tuning than Ampere+. From the results so far I'd say disable Q8_0 on Nvidia and on AMD RDNA+.

@jeffbolznv
Copy link
Collaborator

I don't think that's a big problem, but maybe a minimum vector size before mmvq gets enabled would solve it?

Yeah, or maybe also taking number of rows into account.

It would also be interesting to see how this performs on Nvidia Pascal and Turing, to see if they require different tuning than Ampere+.

Agreed. My guess is that the DP4 path would be relatively better on those.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec branch from 730ba00 to 0cfc795 Compare August 24, 2025 17:10
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 24, 2025

Tuning is really hard, especially for Intel. So many options and tweaks with subgroups and different variants of reductions and other shader parameters. I got some basic tuning for Intel, AMD GCN and Nvidia set up now.

@jeffbolznv I can still see advantages for Q8_0 on RTX 3090. I picked a threshold that works for me, but I think it might be different for you. Can you give it a try? To see absolute differences, you can set GGML_VK_DISABLE_MMVQ to turn off mmvq entirely or GGML_VK_FORCE_MMVQ to override the device tuning logic and just enable it where possible. I created (had AI create for me) a script to accumulate and compare perf_logger output, might be useful for you, too: https://gist.github.com/0cc4m/29b0276e675a36ca8e6cb8ba4f5b231a

Here are new benchmarks:

Nvidia RTX 3090
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 3701.74 ± 18.02 3713.43 ± 55.10 +0.3%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 234.27 ± 28.93 242.50 ± 30.86 +3.5%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 3677.74 ± 7.41 3747.63 ± 2.02 +1.9%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 236.89 ± 1.78 244.24 ± 2.23 +3.1%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 3751.44 ± 6.95 3642.88 ± 252.23 -2.9%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 168.62 ± 0.37 169.40 ± 2.60 +0.5%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 3725.04 ± 3.17 3720.26 ± 28.96 -0.1%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 164.57 ± 0.27 162.11 ± 0.15 -1.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1939.17 ± 11.94 1917.83 ± 20.08 -1.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 134.02 ± 1.26 139.15 ± 0.24 +3.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1911.68 ± 7.42 1831.70 ± 24.06 -4.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 129.21 ± 0.20 134.80 ± 0.36 +4.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1873.80 ± 11.52 1858.17 ± 53.22 -0.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 124.22 ± 0.49 133.92 ± 0.34 +7.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1831.27 ± 27.99 1760.39 ± 33.17 -3.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 121.09 ± 0.27 128.05 ± 0.17 +5.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1905.03 ± 10.47 1863.49 ± 68.54 -2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 90.58 ± 0.07 90.44 ± 0.03 -0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1869.36 ± 19.90 1767.49 ± 35.39 -5.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 88.99 ± 0.09 88.52 ± 0.06 -0.5%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 1327.03 ± 10.50 1279.76 ± 30.48 -3.6%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 91.57 ± 0.16 98.78 ± 0.23 +7.9%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 1282.49 ± 19.75 1211.73 ± 21.10 -5.5%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 88.85 ± 0.13 94.37 ± 0.08 +6.2%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 1309.24 ± 6.81 1306.52 ± 23.61 -0.2%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 61.21 ± 0.07 61.48 ± 0.06 +0.4%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 1254.20 ± 20.83 1227.93 ± 14.97 -2.1%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 60.04 ± 0.02 59.99 ± 0.03 -0.1%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 695.16 ± 9.76 693.80 ± 11.80 -0.2%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 50.02 ± 0.16 56.86 ± 0.07 +13.7%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 662.82 ± 9.87 662.38 ± 3.13 -0.1%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 49.14 ± 0.13 55.72 ± 0.04 +13.4%
AMD Radeon Pro VII
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 1720.31 ± 0.73 1853.85 ± 2.84 +7.8%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 139.20 ± 0.31 166.69 ± 0.72 +19.7%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 1491.02 ± 4.87 1584.51 ± 0.77 +6.3%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 127.38 ± 0.94 148.41 ± 1.18 +16.5%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 1510.55 ± 0.89 1610.08 ± 1.50 +6.6%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 127.23 ± 0.07 134.14 ± 2.72 +5.4%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 1331.93 ± 0.53 1403.96 ± 1.68 +5.4%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 115.90 ± 0.18 118.83 ± 0.16 +2.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 718.90 ± 0.47 736.06 ± 0.21 +2.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 72.94 ± 0.16 90.42 ± 0.26 +24.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 651.70 ± 0.77 664.95 ± 0.69 +2.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 69.61 ± 0.17 84.42 ± 0.15 +21.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 615.65 ± 0.60 641.31 ± 0.27 +4.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 64.29 ± 0.18 93.54 ± 0.49 +45.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 564.85 ± 0.41 586.86 ± 0.38 +3.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 61.63 ± 0.17 87.24 ± 0.14 +41.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 630.72 ± 0.35 638.26 ± 5.14 +1.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 59.32 ± 0.20 68.18 ± 0.08 +14.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 576.39 ± 0.61 589.16 ± 0.49 +2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 56.42 ± 0.23 64.73 ± 0.05 +14.7%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 466.84 ± 1.03 483.12 ± 0.24 +3.5%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 49.19 ± 0.06 63.18 ± 0.26 +28.4%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 431.72 ± 0.04 443.78 ± 0.21 +2.8%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 46.68 ± 0.18 59.42 ± 0.03 +27.3%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 411.28 ± 0.46 423.79 ± 0.33 +3.0%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 38.31 ± 0.14 45.81 ± 0.04 +19.6%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 381.71 ± 0.13 392.65 ± 0.24 +2.9%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 36.86 ± 0.05 43.65 ± 0.01 +18.4%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 258.27 ± 1.00 268.83 ± 0.31 +4.1%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 26.28 ± 0.15 35.43 ± 0.04 +34.8%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 243.33 ± 0.17 256.70 ± 0.17 +5.5%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 25.33 ± 0.02 34.16 ± 0.01 +34.9%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 258.27 ± 1.00 267.53 ± 0.11 +3.6%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 26.28 ± 0.15 35.34 ± 0.09 +34.5%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 243.33 ± 0.17 258.51 ± 0.13 +6.2%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 25.33 ± 0.02 34.10 ± 0.01 +34.6%
AMD Radeon RX 6800 XT
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 3111.30 ± 15.52 3536.21 ± 12.36 +13.7%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 175.49 ± 5.09 178.59 ± 1.40 +1.8%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 2952.98 ± 1.87 3335.84 ± 6.22 +13.0%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 162.83 ± 0.04 164.12 ± 0.04 +0.8%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 2637.61 ± 12.75 2927.92 ± 13.02 +11.0%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 116.79 ± 0.01 117.23 ± 0.02 +0.4%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 2530.36 ± 2.67 2792.27 ± 3.10 +10.4%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 110.41 ± 0.02 110.64 ± 0.01 +0.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1464.40 ± 0.94 1662.32 ± 1.54 +13.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 90.90 ± 0.01 92.15 ± 0.01 +1.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1396.95 ± 0.78 1578.59 ± 0.33 +13.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 86.23 ± 0.02 87.16 ± 0.01 +1.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1425.91 ± 1.87 1614.64 ± 0.45 +13.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 82.50 ± 0.02 84.79 ± 0.01 +2.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1361.51 ± 0.57 1535.38 ± 0.96 +12.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 78.36 ± 0.02 80.12 ± 0.02 +2.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1249.83 ± 1.00 1393.34 ± 0.94 +11.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 56.38 ± 0.01 56.78 ± 0.00 +0.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1204.72 ± 0.09 1336.60 ± 0.38 +10.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 53.82 ± 0.00 53.94 ± 0.01 +0.2%
Intel A770
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 pp512 1630.55 ± 7.94 1883.27 ± 4.11 +15.5%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 0 tg128 68.08 ± 0.10 72.92 ± 0.07 +7.1%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 pp512 411.87 ± 0.44 425.37 ± 0.14 +3.3%
llama 3B Q4_0 1.78 GiB 3.21 B Vulkan 99 1 tg128 44.55 ± 0.03 46.24 ± 0.09 +3.8%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 pp512 1411.42 ± 1.64 1634.99 ± 2.51 +15.8%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 0 tg128 23.09 ± 0.03 63.24 ± 0.09 +173.9%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 pp512 396.78 ± 0.13 412.44 ± 0.32 +3.9%
llama 3B Q8_0 3.18 GiB 3.21 B Vulkan 99 1 tg128 19.38 ± 0.01 42.18 ± 0.06 +117.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 742.45 ± 0.82 828.01 ± 1.81 +11.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 37.86 ± 0.04 37.97 ± 0.01 +0.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 239.62 ± 0.07 252.04 ± 0.08 +5.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 28.30 ± 0.04 28.40 ± 0.01 +0.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 739.79 ± 1.00 824.99 ± 2.26 +11.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 34.21 ± 0.02 43.09 ± 0.02 +26.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 241.57 ± 0.15 251.77 ± 0.20 +4.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 26.19 ± 0.06 30.90 ± 0.02 +18.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 660.65 ± 0.89 735.19 ± 1.16 +11.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 10.98 ± 0.00 34.71 ± 0.01 +216.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 231.92 ± 0.15 242.37 ± 0.13 +4.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 10.02 ± 0.00 26.47 ± 0.00 +164.2%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 pp512 500.67 ± 1.79 564.41 ± 0.37 +12.7%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 0 tg128 27.03 ± 0.04 25.51 ± 0.02 -5.6%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 pp512 183.46 ± 0.06 192.74 ± 0.09 +5.1%
llama 13B Q4_0 6.60 GiB 12.25 B Vulkan 99 1 tg128 20.67 ± 0.01 19.85 ± 0.01 -4.0%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 pp512 447.84 ± 0.84 505.69 ± 0.74 +12.9%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 0 tg128 7.28 ± 0.01 24.05 ± 0.02 +230.4%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 pp512 175.62 ± 0.07 184.32 ± 0.11 +5.0%
llama 13B Q8_0 12.12 GiB 12.25 B Vulkan 99 1 tg128 6.74 ± 0.00 18.81 ± 0.01 +179.1%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 276.04 ± 0.74 306.49 ± 0.47 +11.0%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 16.51 ± 0.00 15.78 ± 0.00 -4.4%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 142.42 ± 0.06 152.80 ± 0.07 +7.3%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 13.87 ± 0.01 13.37 ± 0.01 -3.6%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 pp512 276.04 ± 0.74 305.53 ± 1.01 +10.7%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 0 tg128 16.51 ± 0.00 15.74 ± 0.00 -4.7%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 pp512 142.42 ± 0.06 152.73 ± 0.05 +7.2%
llama 13B Q4_0 12.56 GiB 23.57 B Vulkan 99 1 tg128 13.87 ± 0.01 13.37 ± 0.01 -3.6%

I might have unintentionally tuned Intel for small models here, I'll take a look at large models soon.

if (quantize_y) {
if (ctx->prealloc_y_last_pipeline_used != to_q8_1.get() ||
ctx->prealloc_y_last_tensor_used != src1) {
ggml_vk_quantize_q8_1(ctx, subctx, { d_Qy, qy_buf_offset, VK_WHOLE_SIZE }, { d_Y, 0, VK_WHOLE_SIZE }, y_ne * ne12 * ne13, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "prealloc_y_need_sync" logic is needed for quantize_y, too. I think it would make sense to pull in the change from #15544, it may help performance here.

@jeffbolznv
Copy link
Collaborator

Here are some q8_0 numbers on 5090 from the latest change with my suggestion applied (which didn't obviously help):

disable:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\airoboros-m-7b-3.1.2.Q8_0.gguf -m c:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m c:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m c:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        291.37 ± 2.01 |
| llama 7B Q8_0                  |   7.17 GiB |     7.24 B | Vulkan     |  99 |  1 |           tg128 |        163.16 ± 3.79 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        155.42 ± 5.14 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        162.54 ± 4.57 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |       257.74 ± 12.68 |

default:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\airoboros-m-7b-3.1.2.Q8_0.gguf -m c:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m c:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m c:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        287.45 ± 1.32 |
| llama 7B Q8_0                  |   7.17 GiB |     7.24 B | Vulkan     |  99 |  1 |           tg128 |        159.99 ± 1.93 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        154.47 ± 2.67 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        160.30 ± 2.45 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        259.48 ± 1.21 |

force:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\airoboros-m-7b-3.1.2.Q8_0.gguf -m c:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf -m c:\models\mistral-7b-instruct-v0.3-q8_0.gguf -m c:\models\qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        280.75 ± 1.09 |
| llama 7B Q8_0                  |   7.17 GiB |     7.24 B | Vulkan     |  99 |  1 |           tg128 |        158.34 ± 1.35 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        151.25 ± 4.87 |
| llama 7B Q8_0                  |   7.17 GiB |     7.25 B | Vulkan     |  99 |  1 |           tg128 |        156.77 ± 5.93 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |  1 |           tg128 |        249.43 ± 2.61 |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants