You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+19-7
Original file line number
Diff line number
Diff line change
@@ -72,6 +72,25 @@ _Average performance on the RULER dataset with 4k context length and Loogle Shor
72
72
73
73
Please refer to the [evaluation](evaluation/README.md) directory for more details and results.
74
74
75
+
## KV cache quantization
76
+
77
+
We support KV cache quantization through the transformers `QuantizedCache` class (see [HF blog post](https://huggingface.co/blog/kv-cache-quantization#how-to-use-quantized-kv-cache-in-%F0%9F%A4%97-transformers)). To use it, simply pass a cache object to your pipeline:
78
+
79
+
```python
80
+
from transformers import QuantizedCacheConfig, QuantoQuantizedCache
81
+
82
+
config = QuantizedCacheConfig(nbits=4)
83
+
cache = QuantoQuantizedCache(config)
84
+
85
+
pipe(..., cache=cache)
86
+
```
87
+
88
+
By default, the `DynamicCache` is used (no quantization).
89
+
90
+
> [!IMPORTANT]
91
+
> To use the `QuantizedCache`, you need to install additional dependencies (e.g. `pip install optimum-quanto==0.2.4`, see also [this issue](https://github.com/huggingface/transformers/issues/34848)).
92
+
93
+
75
94
## FAQ
76
95
77
96
<details><summary>
@@ -165,10 +184,3 @@ Check the [demo notebook](notebooks/per_layer_compression_demo.ipynb) for more d
165
184
</details>
166
185
167
186
<details><summary>
168
-
169
-
### Is quantization supported ?
170
-
</summary>
171
-
172
-
We don't support quantization of the KV cache yet. Quantization can achieve up to 4x compression moving from (b)float16 to int4 and we believe it is orthogonal to the KV cache pruning strategies proposed in this repository.
0 commit comments