Float types (fp8, fp4, nf4, ...) #11981

708-145 · 2025-02-20T18:59:20Z

708-145
Feb 20, 2025

What about adding low bit floating point data types? Default would be to dequant to fp16.
On top I'm writing an efficient SIMD backend that works in quantized space for the data types mentioned below. Up to 8bit width is supported with my approach.
Especially nf4, nf4dq (double quant) and fp4 would enable lossless conversion from existing 4bit bnb encoded models from HF.

Just want to hear your opinions / interest on this topic. I'll come back with a PR at some point when my backend is ready.
Also, if I missed an important low bit data type just point it out.

nf4
nf4dq
fp8e5m2
fp8e4m3
fp8e3m4
fp6e4m1
fp6e3m2
fp6e2m3
fp5e3m1
fp5e2m2
fp5e1m3
fp4e3m0
fp4e2m1
fp4e1m2
fp3e2m0
fp3e1m1

hg0428 · 2025-04-18T15:36:07Z

hg0428
Apr 18, 2025

With the BlackWell we now have native FP4, and so this would be amazing.

0 replies

Djip007 · 2025-04-18T22:16:14Z

Djip007
Apr 18, 2025

There is a try: #10055 for fp8

For hardware support what I am not sure is the format. there is at least 2 : https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md
for software I like to try the fnuz format... ( with E3M4 that is the most usefull I think for LLM )

3 replies

708-145 Apr 19, 2025
Author

Very cool. In the light that you started with fp8, I'll have an attempt at fp4 and nf4 in the next couple days.
My first implementation will be based on IQ4_NL (4.5 bpw) and IQ4_XS (4.25 bpw) as I have already understood those two.

Thinking further, does anyone know how bnb saves fp4 and nf4 in the safetensors bitstream? Would be good to be able to convert those over to gguf as well.

Djip007 Apr 20, 2025

https://rocm.docs.amd.com/projects/HIP/en/docs-develop/reference/low_fp_types.html#supported-devices

look like AMD gpu support FP8_FNUZ in hardware...
For now I only test with fp8 weight convert to BF16 for mulmat compute... Use hardware compute is more "complicated" as it need runtime "quantisation" to fp8 for the activation . (and I do not have hardware that support it ;) )

708-145 Apr 20, 2025
Author

I will start with software support for my new 4bit types first and add that to ggml.

FP8_FNUZ support can be added to bpp once it is supported in gguf.
Still thinking about the license but currently it looks like the the CPU-only version will be free to use.

P.S.: Looking further all cards that support at least Shader Model 6.4 can be made to support the new types including fp8. It will use a new custom backend (bpp for GPU) to achieve this. But given how muc hI struggle to implement efficient AVX2 code this is well above my current implementation skill level and will need a lot of time since I have to learn GPU shader programming first.

708-145 · 2025-04-20T16:14:05Z

708-145
Apr 20, 2025
Author

In an experiment I simply replaced IQ4_NL with NF4 quant and FP4 quant respectively and it worked well. It's as simple as replacing the kvalues LUT!
For me the real difficulty now is in adding 2 dtypes end2end in llama.cpp.

static const int8_t kvalues_iq4nl[16] = {-127, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 89, 113};
static const int8_t kvalues_nf4[16] = {-128, -89, -67, -51, -37, -24, -12, 0, 10, 20, 31, 43, 56, 71, 92, 127};
static const int8_t kvalues_fp4e2m1[16] = {-120, -80, -60, -40, -30, -20, -15,-10, 10, 15, 20, 30, 40, 60, 80, 120};

4 replies

Djip007 Apr 20, 2025

How is this values computes?

708-145 Apr 20, 2025
Author

fp4 e2m1 values scaled by 10x to match int8 range, rounded and sorted.
nf4 is derived from this shader code, scaled to int8 range and sorted: https://github.com/bitsandbytes-foundation/bitsandbytes/blob/25abf8d95f8a33f38e2ce6f637768b442379ccd9/csrc/kernels.cu#L223-L329

Djip007 Apr 22, 2025

fp4 e2m1 => at least 0 is missing ?

708-145 Apr 22, 2025
Author

For e2m1 floating-point format, the 16 possible values are:
-6.0, -4.0, -3.0, -2.0, -1.5, -1.0, -0.75, -0.5, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0
There is no zero.

Comparison chart:

708-145 · 2025-04-20T16:18:54Z

708-145
Apr 20, 2025
Author

Actually 4 dtypes:
NF4 , 4.5 bpw
NF4_XS , 4.25 bpw
FP4 , 4.5 bpw
FP4_XS , 4.25 bpw

0 replies

708-145 · 2025-05-15T21:19:33Z

708-145
May 15, 2025
Author

708-145/llama.cpp@master...708-145:llama.cpp:nf4

NF4_XS and FP4_XS are working on CPU when using that branch. I could not detect any quality difference relative to IQ4_XS.
There is no support for any other backend yet.

0 replies

Float types (fp8, fp4, nf4, ...) #11981

Uh oh!

708-145 Feb 20, 2025

Replies: 5 comments · 7 replies

Uh oh!

hg0428 Apr 18, 2025

Uh oh!

Djip007 Apr 18, 2025

Uh oh!

708-145 Apr 19, 2025 Author

Uh oh!

Djip007 Apr 20, 2025

Uh oh!

708-145 Apr 20, 2025 Author

Uh oh!

708-145 Apr 20, 2025 Author

Uh oh!

Djip007 Apr 20, 2025

Uh oh!

708-145 Apr 20, 2025 Author

Uh oh!

Djip007 Apr 22, 2025

Uh oh!

Uh oh!

708-145 Apr 22, 2025 Author

Uh oh!

708-145 Apr 20, 2025 Author

Uh oh!

708-145 May 15, 2025 Author

708-145
Feb 20, 2025

Replies: 5 comments 7 replies

hg0428
Apr 18, 2025

Djip007
Apr 18, 2025

708-145 Apr 19, 2025
Author

708-145 Apr 20, 2025
Author

708-145
Apr 20, 2025
Author

708-145 Apr 20, 2025
Author

708-145 Apr 22, 2025
Author

708-145
Apr 20, 2025
Author

708-145
May 15, 2025
Author