Skip to content

Commit 573b2ad

Browse files
committed
update readme
1 parent aba8ebe commit 573b2ad

File tree

1 file changed

+13
-16
lines changed

1 file changed

+13
-16
lines changed

Readme.md

+13-16
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ This repository contains the official implementation of Half-Quadratic Quantizat
1717
<li> HQQ is compatible with peft training.</li>
1818
<li> We try to make HQQ fully compatible `torch.compile` for faster inference and training.</li>
1919
</ul>
20-
20+
2121
<b>What is the quality of the quantized models? </b><br>
2222
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: <a href="https://mobiusml.github.io/hqq_blog/">HQQ</a>, <a href="https://mobiusml.github.io/1bit_blog/">HQQ+</a>.<br>
2323

@@ -26,10 +26,10 @@ This repository contains the official implementation of Half-Quadratic Quantizat
2626

2727
<b>What quantization settings should I use?</b><br>
2828
You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend. If you want to use lower like `nbits=2`, you should use `axis=0`with a low group-size via HQQ+, meaning adding low-rank adapters and fine-tune with a small dataset. <br>
29-
29+
3030
<b>What does the `axis` parameter mean? </b><br>
3131
The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.<br>
32-
32+
3333
<b>What is the difference between HQQ and HQQ+?</b><br>
3434
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.<br>
3535

@@ -65,9 +65,6 @@ The quantization parameters are set as follows:
6565

6666
- ```nbits``` (int): supports 8, 4, 3, 2, 1 bits.
6767
- ```group_size``` (int): no restrictions as long as ```weight.numel()``` is divisible by the ```group_size```.
68-
- ```quant_zero``` (bool): if True, it quantizes the zero-point to 8-bit without grouping.
69-
- ```quant_scale``` (bool): if True, it quantizes the scaling factor to 8-bit with a group_size of 128.
70-
- ```offload_meta``` (bool): if True, meta-data is offloaded to the CPU.
7168
- ```view_as_float``` (bool): if True, the quantized parameter is viewed as float instead of a int type.
7269

7370
Setting ```offload_meta=True``` drastically decreases the GPU memory requirements but makes processing slower for smaller group-sizes. When turned on, you can run Llama2-70B and Mixtral with HQQ 2-bit using only 18.8GB and 13GB VRAM respectively.
@@ -76,9 +73,9 @@ Setting ```offload_meta=True``` drastically decreases the GPU memory requirement
7673
#### Native Backends
7774
The following native backends can be used by the `HQQLinear` module:
7875
```Python
79-
HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend
76+
HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend - Default
8077
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch
81-
HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend
78+
HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend - only axis=0 supported
8279
```
8380
The ```HQQBackend.ATEN``` backend is automatically installed and used by default when available.
8481
Note that ```HQQBackend.ATEN``` only supports `axis=0`. For `axis=1` you need to use ```HQQBackend.PYTORCH``` or ```HQQBackend.PYTORCH_COMPILE```.
@@ -88,7 +85,7 @@ Below you can find the speed-up benchmark with various backends, ```HQQBackend.P
8885
<div class="row"><center>
8986
<div class="column">
9087
<img src="https://github.com/mobiusml/hqq/blob/master/imgs/hqq_cuda_dequant_llama27b_titanrtx.png" alt="Titan RTX" style="width:48%">
91-
<img src="https://github.com/mobiusml/hqq/blob/master/imgs/hqq_cuda_dequant_llama270b_a100.png" alt="A100" style="width:48%">
88+
<img src="https://github.com/mobiusml/hqq/blob/master/imgs/hqq_cuda_dequant_llama270b_a100.png" alt="A100" style="width:48%">
9289
</div>
9390
</center>
9491
</div>
@@ -124,7 +121,7 @@ For usage with HF's transformers, see the example below from the <a href="https:
124121
from transformers import AutoModelForCausalLM, HqqConfig
125122

126123
# All linear layers will use the same quantization config
127-
quant_config = HqqConfig(nbits=4, group_size=64, quant_zero=False, quant_scale=False, axis=1)
124+
quant_config = HqqConfig(nbits=4, group_size=64)
128125

129126
# Load and quantize
130127
model = AutoModelForCausalLM.from_pretrained(
@@ -145,7 +142,7 @@ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype
145142

146143
#Quantize
147144
from hqq.models.hf.base import AutoHQQHFModel
148-
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
145+
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
149146
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)
150147
```
151148
#### Save/Load
@@ -160,7 +157,7 @@ AutoHQQHFModel.save_quantized(model, save_dir)
160157
model = AutoHQQHFModel.from_quantized(save_dir)
161158
```
162159
#### Setting a backend
163-
You can set a native backned as follows:
160+
You can set a native backend as follows:
164161
```Python
165162
HQQLinear.set_backend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCH_COMPILE)
166163
```
@@ -185,8 +182,8 @@ You can set up various quantization configurations for different layers by speci
185182
#### Transformers 🤗
186183
```Python
187184
# Each linear layer with the same tag will use a dedicated quantization config
188-
q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}
189-
q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}
185+
q4_config = {'nbits':4, 'group_size':64}
186+
q3_config = {'nbits':3, 'group_size':32}
190187

191188
quant_config = HqqConfig(dynamic_config={
192189
'self_attn.q_proj':q4_config,
@@ -202,8 +199,8 @@ quant_config = HqqConfig(dynamic_config={
202199
#### HQQ lib
203200
```Python
204201
from hqq.core.quantize import *
205-
q4_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_zero=False, quant_scale=False)
206-
q3_config = BaseQuantizeConfig(nbits=3, group_size=32, quant_zero=False, quant_scale=False)
202+
q4_config = BaseQuantizeConfig(nbits=4, group_size=64)
203+
q3_config = BaseQuantizeConfig(nbits=3, group_size=32)
207204

208205
quant_config = {'self_attn.q_proj':q4_config,
209206
'self_attn.k_proj':q4_config,

0 commit comments

Comments
 (0)