You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: Readme.md
+13-16
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ This repository contains the official implementation of Half-Quadratic Quantizat
17
17
<li> HQQ is compatible with peft training.</li>
18
18
<li> We try to make HQQ fully compatible `torch.compile` for faster inference and training.</li>
19
19
</ul>
20
-
20
+
21
21
<b>What is the quality of the quantized models? </b><br>
22
22
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: <ahref="https://mobiusml.github.io/hqq_blog/">HQQ</a>, <ahref="https://mobiusml.github.io/1bit_blog/">HQQ+</a>.<br>
23
23
@@ -26,10 +26,10 @@ This repository contains the official implementation of Half-Quadratic Quantizat
26
26
27
27
<b>What quantization settings should I use?</b><br>
28
28
You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend. If you want to use lower like `nbits=2`, you should use `axis=0`with a low group-size via HQQ+, meaning adding low-rank adapters and fine-tune with a small dataset. <br>
29
-
29
+
30
30
<b>What does the `axis` parameter mean? </b><br>
31
31
The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.<br>
32
-
32
+
33
33
<b>What is the difference between HQQ and HQQ+?</b><br>
34
34
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.<br>
35
35
@@ -65,9 +65,6 @@ The quantization parameters are set as follows:
65
65
66
66
-```nbits``` (int): supports 8, 4, 3, 2, 1 bits.
67
67
-```group_size``` (int): no restrictions as long as ```weight.numel()``` is divisible by the ```group_size```.
68
-
-```quant_zero``` (bool): if True, it quantizes the zero-point to 8-bit without grouping.
69
-
-```quant_scale``` (bool): if True, it quantizes the scaling factor to 8-bit with a group_size of 128.
70
-
-```offload_meta``` (bool): if True, meta-data is offloaded to the CPU.
71
68
-```view_as_float``` (bool): if True, the quantized parameter is viewed as float instead of a int type.
72
69
73
70
Setting ```offload_meta=True``` drastically decreases the GPU memory requirements but makes processing slower for smaller group-sizes. When turned on, you can run Llama2-70B and Mixtral with HQQ 2-bit using only 18.8GB and 13GB VRAM respectively.
0 commit comments