Skip to content

Releases: mobiusml/hqq

v.0.2.3.post1

20 Feb 11:12
Compare
Choose a tag to compare

Bug fixes:

  • Check W_q in state dict to fix peft issue #151
  • Fix bugs related to AutoHQQHFModel.save_to_safetensors

v0.2.3

17 Feb 08:43
6e4c992
Compare
Choose a tag to compare
  • VLLM support via patching - GemLite backend + on-the-fly quantization
  • Add support for Aria
  • Add support to load quantized SequenceClassification
  • Faster decoding via (custom cudagraphs, sdpa math backend, etc.)
  • Fix bugs related torch compile and hf_generator related to the newer transformers versions
  • Fix bugs related to saving quantized models with no grouping
  • Fix bugs related to saving large quantized models
  • Update examples
  • Add support for HQQLinear .to(device)

v0.2.2

12 Sep 15:23
Compare
Choose a tag to compare

HQQ v0.2.2

  • Support static cache compilation without using HFGenerator
  • Fixing various issues related to torch.compile

v.0.2.1

29 Aug 16:25
Compare
Choose a tag to compare

HQQ v0.2.1

v.0.2.0

28 Aug 10:05
Compare
Choose a tag to compare

HQQ v0.2.0

  • Bug fixes
  • Safetensors support for transformers via huggingface/transformers#33141
  • quant_scale, quant_zero and offload_meta are now deprecated. You can still use them with the hqq lib, but you can't use them with the transformers lib

v.0.1.8

11 Jul 12:00
Compare
Choose a tag to compare

HQQ v0.1.8

  • Add BitBlas backend support
  • Simpler HQQLinear from weights HQQLinear.from_weights(W, bias, etc.)
  • Fix memory leak while swaping layers for the TorchAO Backend
  • Add HQQLinear.unpack() call

v0.1.7.post3

28 May 07:48
Compare
Choose a tag to compare

HQQ v0.1.7.post3

  • Enable CPU quantization and runtime
  • _load_state_dict fix
  • fix extra_repr in HQQLinear
  • fix from_quantized bugs
  • fix | typing
  • fix 3-bit axis=1 slicing bug
  • add 5/6 bit for testing

v0.1.7.post2

06 May 16:41
Compare
Choose a tag to compare

HQQ v0.1.7.post2

  • Various bug fixes, especially with AutoHQQHFModel and the patching logic, to make it work with any transformers model.
  • Readme refactoring.
  • Whisper example.

v0.1.7

24 Apr 08:59
Compare
Choose a tag to compare

HQQ v0.1.7

  • Faster inference with torchao / marlin 4-bit kernels
  • Multi-gpu support for model.quantize()
  • Custom HF generator
  • Various bug fixes/improvements

v0.1.6.post2

19 Mar 18:24
Compare
Choose a tag to compare

HQQ v0.1.6.post2

Same as v0.1.6 with setup.py fixes:

  • find_packages fix: #25
  • Auto-build CUDA kernels via pypi package: #26