Releases: intel/auto-round
v0.6.0
Highlights
- provide experimental support for gguf q*_k format and customized mixed bits setting
- support xpu in triton backend by @wenhuach21 in #563
- add torch backend by @WeiweiZhang1 in #555
- provide initial support of llmcompressor format, only INT8 W8A8 dynamic quantization is supported by @xin3he in #646
What's Changed
- bump version into v0.5.1 by @XuehaoSun in #540
- Freeze pytorch & ipex version in CI by @XuehaoSun in #541
- fix_quantization_config_for_inference by @WeiweiZhang1 in #542
- [critic bug]remove redundant round in dq simulation by @wenhuach21 in #543
- update readme by @wenhuach21 in #550
- add recipes for qwen3 8b and 14b by @n1ck-guo in #552
- itrex requires torch<2.7 by @XuehaoSun in #548
- [GGUF STEP4] fix search bug and improve packing & eval speed by @n1ck-guo in #545
- refine xpu requirement/config json and fix several issues by @wenhuach21 in #558
- add UE5M3 simulation by @wenhuach21 in #562
- support xpu in triton backend by @wenhuach21 in #563
- fix typo in backend by @wenhuach21 in #564
- update habana docker to 1.21.0 by @XuehaoSun in #566
- Support for more gguf format and float zp for Q*_1 by @n1ck-guo in #560
- update readme by @wenhuach21 in #569
- update readme by @wenhuach21 in #571
- support for llava-based hf model by @n1ck-guo in #568
- add gguf accuracy data by @wenhuach21 in #574
- add sym & asym gguf quant for gguf baseline (iter==0) by @n1ck-guo in #573
- modify default asym 4bits auto-round format to awq, fix save folder typo for mllm by @WeiweiZhang1 in #575
- improve the robustness of parsing vlm config by @wenhuach21 in #577
- switch to transformers API in cpu ut by @wenhuach21 in #580
- add torch backend by @WeiweiZhang1 in #555
- fix awq exporting at group_size=-1 by @wenhuach21 in #579
- refact cuda ut to facilitate automation by @n1ck-guo in #559
- fix tensor shape mismatch error for API usage by @WeiweiZhang1 in #582
- fix device bug at calibration by @wenhuach21 in #587
- Update gguf_accuracy (q3_ks) by @SinpackKonmakan in #590
- add recipes for deepseek-r1-0528 by @n1ck-guo in #588
- correct errors of deepseek-r1-0528 recipes by @n1ck-guo in #591
- fix cuda ut by @wenhuach21 in #592
- Bump protobuf from 3.20.1 to 3.20.2 in /test/test_cuda by @dependabot[bot] in #585
- rm unnecessary forward to improve speed by @wenhuach21 in #593
- update readme by @wenhuach21 in #597
- fix q2k bug by @n1ck-guo in #599
- support for q4_k_m by @n1ck-guo in #596
- fix vlm uttest path error by @WeiweiZhang1 in #601
- fix lots of gguf critic bugs and support imatrix in rtn mode by @wenhuach21 in #595
- fix gguf bug by @wenhuach21 in #610
- mv some checkers by @wenhuach21 in #611
- fix gguf packing bug and moe regression by @wenhuach21 in #614
- support customized mixed bits for gguf by @wenhuach21 in #615
- fix double quant sym bug by @wenhuach21 in #616
- FP8 WOQ export by @wenhuach21 in #617
- fix bug of q5_k_s w/ imatrix by @n1ck-guo in #620
- add auto-round related vllm and transformers UT by @WeiweiZhang1 in #613
- refine_doc_0624 by @WeiweiZhang1 in #619
- fix not using imatrix for gguf at rtn mode by @wenhuach21 in #623
- fix vlm hf config loading issue by @WeiweiZhang1 in #624
- refine gguf rtn algorithm and fix bugs by @wenhuach21 in #630
- fix gguf bug of moe models and lmhead/embedding bits setting regression by @n1ck-guo in #628
- [BUG FIX] fix bug of deepseek gguf:q*k by @n1ck-guo in #637
- support packing immediately for gguf to reduce ram usage by @wenhuach21 in #638
- support llmcompressor format by @xin3he in #646
- fix norm_bias_tuning by @wenhuach21 in #639
- [W4A8]Fix Packing by @yiliu30 in #648
- Integrate RTN quantization into GGUF packing to enhance robustness by @n1ck-guo in #644
- Remove vlm cuda UT dependencies version restrictions by @XuehaoSun in #651
- speedup mxfp tuning and fix nvfp bug by @wenhuach21 in #647
- support two more calib datasets and fix embedding layer bug by @wenhuach21 in #653
- fix some issues by @wenhuach21 in #655
- fix bug of q4_0 and q5_0 at iters==0 by @n1ck-guo in #658
- support vlm models for gguf format by @n1ck-guo in #654
- fix bug of block-wise quant imatrix by @n1ck-guo in #663
- fix gguf block-wise issue by @wenhuach21 in #664
- fix bugs of export deepseek gguf format when iters=0 and q3k accuracy by @n1ck-guo in #665
- handle zeros in imatrix by @wenhuach21 in #667
- fix ut issue by @WeiweiZhang1 in #668
- fix cuda hanging issue during packing by @WeiweiZhang1 in #669
- support to use lm_eval for vlm by @n1ck-guo in #670
- add trust remote code to gguf format load tokenizer by @n1ck-guo in #675
- fix 3bits asym accuracy and calib dataset issues by @WeiweiZhang1 in #674
- restrict accelerate version to reduce ram usage by @wenhuach21 in #673
- rm low_cpu when loading the model by @wenhuach21 in #676
- rm_old_vlm_cuda_ut by @WeiweiZhang1 in #678
- update gguf convert file and fix bug of permute bug by @n1ck-guo in #679
- fix gguf regression for large models by @wenhuach21 in #680
- fix gemma vlm gguf regression by @wenhuach21 in #685
New Contributors
- @SinpackKonmakan made their first contribution in #590
- @xin3he made their first contribution in #646
Full Changelog: v0.5.1...v0.6.0
v0.5.1:bug fix release
What's Changed
- bump version into v0.5.0 by @XuehaoSun in #538
- fix triton multiple gpus and some other issues by @wenhuach21 in #539
Full Changelog: v0.5.0...v0.5.1
v0.5.0
Highlights
- refine autoround format inference, support 2,3,4,8 bits and marlin kernel and fix several bugs in auto-round format
- support xpu in tuning and inference by @wenhuach21 in #481
- support for more vlms by @n1ck-guo in #390
- change quantization method name and made several refinements by @wenhuach21 in #500
- support rtn via iters==0 by @wenhuach21 in #510
- fix bug of mix calib dataset by @n1ck-guo in #492
What's Changed
- support xpu in tuning and inference by @wenhuach21 in #481
- add light ut, fixtypos by @WeiweiZhang1 in #483
- bump into v0.4.7 by @XuehaoSun in #487
- fix dataset combine bug by @wenhuach21 in #489
- fix llama 8b time cost by @WeiweiZhang1 in #490
- update 2bits acc results by @WeiweiZhang1 in #491
- fix bug of mix calib dataset by @n1ck-guo in #492
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #494
- [GGUF support step3]patch for double quant by @n1ck-guo in #473
- refine inference backend/code step 1 by @wenhuach21 in #486
- refine inference step 2 by @wenhuach21 in #498
- change quantization method name and made several refinements by @wenhuach21 in #500
- fix bug of awq/gptq modules_to_not_convert by @n1ck-guo in #501
- use --tasks to control evaluation enabling by @wenhuach21 in #505
- fix gguf eval regression bug by @n1ck-guo in #506
- change to new api in readme by @wenhuach21 in #507
- fix setup issue on cuda machine by @wenhuach21 in #511
- support rtn via iters==0 by @wenhuach21 in #510
- fix critical bug of get_multimodal_block_names by @n1ck-guo in #509
- Update requirements-lib.txt by @yiliu30 in #513
- add group_size divisible check in backend by @wenhuach21 in #512
- support for more vlms by @n1ck-guo in #390
- move gguf-dq test to cuda by @n1ck-guo in #520
- fix bs!=1 for gemma and MiniMax-Text-01 by @wenhuach21 in #515
- add regex support in layer_config setting by @wenhuach21 in #519
- patch for vlm by @n1ck-guo in #518
- rename backend to packing_format in config.json by @wenhuach21 in #521
- fix example's model_dtype by @WeiweiZhang1 in #523
- rm fp16 export in autoround format by @wenhuach21 in #525
- update convert_hf_to_gguf to support more models by @n1ck-guo in #524
- fix light config by @WeiweiZhang1 in #526
- fix typos, add model card link for VLMs by @WeiweiZhang1 in #527
- add backend readme by @wenhuach21 in #528
- update mllm readme by @WeiweiZhang1 in #530
- fix bug of cuda ut by @n1ck-guo in #532
- fix inference issue by @wenhuach21 in #529
- update readme by @wenhuach21 in #531
- refine readme by @WeiweiZhang1 in #536
- fix cuda ut by @n1ck-guo in #537
Full Changelog: v0.4.7...v0.5.0
v0.4.7
Highlights
Support W4AFP8 for HPU. Please refer to Intel Neural Compressor for guidance on running these models. by @yiliu30 in #467
Support packing immediately in new quantization api to save ram usage by @wenhuach21 in #466
20x for awq and 4x for gptq packing speedup on cuda by @wenhuach21 in #459
Support auto-round-light to speed up the tuning process @WeiweiZhang1 in #454
Fix critic bug of mxfp4 in tuningby @wenhuach21 in #451
What's Changed
- step-1 support naive double quant in tuning by @wenhuach21 in #442
- fix critic bug of mxfp4 by @wenhuach21 in #451
- update readme by @wenhuach21 in #455
- update eval by @n1ck-guo in #450
- awq exporting bugfix by @WeiweiZhang1 in #456
- Support force loading into autoround Format by @WeiweiZhang1 in #453
- 20x for awq and 4x for gptq packing speedup by @wenhuach21 in #459
- fixl eval bug by @n1ck-guo in #461
- [STEP-1]W4Afp8 export by @wenhuach21 in #378
- [HPU] Update W4A8 for HPU by @yiliu30 in #467
- support for gemma3 by @n1ck-guo in #468
- upload_auto-round-light results by @WeiweiZhang1 in #454
- GGUF support step2: add naive Q2_KS and Q4_KS by @n1ck-guo in #448
- fix incorrect recipe data by @WeiweiZhang1 in #471
- support for mistral3 by @n1ck-guo in #472
- support to export gemma3 gguf format by @n1ck-guo in #470
- Increase unit test timeout from 120 to 240 minutes by @XuehaoSun in #474
- support packing immediately in new quantization api to save ram usage by @wenhuach21 in #466
- rm redundant line break by @WeiweiZhang1 in #475
- Temporarily close qxk api for new release by @n1ck-guo in #478
- add restrict for exporting act-quant models by @n1ck-guo in #480
Full Changelog: v0.4.6...v0.4.7
v0.4.6
Highlights:
1 set torch compile to false by default in #447
2 Fix packing hang and force to fp16 at exporting in #430
3 align auto_quantizer with Transformers 4.49 in #437
What's Changed
- Fix packing hang, torch compile and force to fp16 at exporting by @wenhuach21 in #430
- fix nblocks issues by @wenhuach21 in #432
- rm gc collect in packing by @wenhuach21 in #438
- align auto_quantizer with main branch in Transformers by @WeiweiZhang1 in #437
- [HPU]Fix compile bug when quant layer by @yiliu30 in #441
- remove tricky setting in mxfp4 by @wenhuach21 in #445
- fix bug of evaluate user model by @n1ck-guo in #444
- Refine funcs by @WeiweiZhang1 in #446
- set torch compile to false by default by @WeiweiZhang1 in #447
Full Changelog: v0.4.5...v0.4.6
v0.4.5
Highlights:
We have enhanced support for extremely large models with the following updates:
Multi-Card Tuning Support: Added basic support for multi-GPU tuning. #415 support naive multi-card tuning
Accelerated Packing Stage: Improved the packing speed (2X-4X)for AutoGPTQ and AutoAWQ formats by leveraging cuda. #407 speedup packing stage for autogptq and autoawq forma
Deepseek V3 GGUF Export: Introduced support for exporting models to the Deepseek V3 GGUF format. #416 support to export deepseek v3 gguf format
What's Changed
- update format readme by @wenhuach21 in #411
- fix log bug and device "auto" bug by @n1ck-guo in #409
- speedup packing stage for autogptq and autoawq format by @wenhuach21 in #407
- support naive multi-card tuning by @wenhuach21 in #415
- support bf16 inference for autoround format by @wenhuach21 in #420
- enable backup pile dataset loading by @WeiweiZhang1 in #417
- fix evaluation device bug, relate to issue 413 by @n1ck-guo in #419
- support to export deepseek v3 gguf format by @n1ck-guo in #416
- fix cuda UT torch_dtype by @WeiweiZhang1 in #423
- fix eval trust_remote_code by @n1ck-guo in #424
Full Changelog: v0.4.4...v0.4.5
v0.4.4 release
Highlights:
1 Fix install issue in #387
2 support to export gguf q4_0 and q4_1 format in #393
3 fix llm cmd line seqlen issue in #399
What's Changed
- fix a critic bug of static activation quantization by @wenhuach21 in #392
- vlm 70B+ in single card by @n1ck-guo in #395
- enhance calibration dataset and add awq pre quantization warning by @wenhuach21 in #396
- support awq format for vlms by @WeiweiZhang1 in #398
- [critic bug]fix llm example seqlen issue by @WeiweiZhang1 in #399
- fix device auto issue by @wenhuach21 in #400
- Fix auto-round install & bump into 0.4.4 by @XuehaoSun in #387
- fix dtype converting issue by @wenhuach21 in #403
- support for deepseek vl2 by @n1ck-guo in #401
- llm_layer_config_bugfix by @WeiweiZhang1 in #406
- support awq with qbits, only support sym by @wenhuach21 in #402
- support to export gguf q4_0 and q4_1 format by @n1ck-guo in #393
Full Changelog: v0.4.3...v0.4.4
v0.4.3: bug fix release
Highlights:
fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
remove the dependency on AutoGPTQ by @XuehaoSun in #380
What's Changed
- support_llava_hf_vlm_example by @WeiweiZhang1 in #381
- fix block_name_to_quantize by @WeiweiZhang1 in #382
- fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
- refine homepage, update model links by @WeiweiZhang1 in #385
- update eval basic usage by @n1ck-guo in #384
- refine error msg and dump more log in the tuning by @wenhuach21 in #386
- remove the dependency on AutoGPTQ for CPU and bump to V0.4.3 by @XuehaoSun in #380
Full Changelog: v0.4.2...v0.4.3
v0.4.2: bug fix release
Highlights
1 Fix autoawq exporting issue
2 remove bias exporting if possible in autogptq format
What's Changed
- bump version into v0.4.1 by @XuehaoSun in #350
- Update docker user and remove baseline UT by @XuehaoSun in #347
- delete llm example and refine readme by @wenhuach21 in #354
- Simulated W4Afp8 Quantization by @wenhuach21 in #331
- add QWQ-32B, VLM, Qwen2.5, Llama3.1 int4 models by @wenhuach21 in #356
- fix awq exporting by @wenhuach21 in #358
- Tensor reshape bugfix by @WeiweiZhang1 in #364
- fix awq backend and fp_layers issue by @wenhuach21 in #363
- fix awq exporting bugs by @wenhuach21 in #365
- fix bug of only_text_test check due to inference issue on cpu by @n1ck-guo in #362
- add gpu test by @wenhuach21 in #367
- using multicard when device set to "auto" by @n1ck-guo in #368
- quant_block_names enhancement by @WeiweiZhang1 in #369
- [HPU] Add lazy mode back by @yiliu30 in #371
- remove bias exporting if possible in autogptq format by @wenhuach21 in #375
- save processor automatically by @n1ck-guo in #372
- Add gpu ut by @wenhuach21 in #370
- fix gpu ut by @n1ck-guo in #376
- fix typos by @wenhuach21 in #377
Full Changelog: v0.4.1...v0.4.2
v0.4.1: bug fix release
Highlights:
- Fixed vllm calibration infinite loop issue
- Corrected the default value for the sym argument in the API configuration.
What's Changed
- fix typo by @wenhuach21 in #342
- vllm/llama-vision llava calibration infinite loop fix by @WeiweiZhang1 in #343
- [HPU]Enhance
numba
check by @yiliu30 in #345 - [VLM]fix bs and grad reset by @n1ck-guo in #344
- [HPU]Enhance installation check by @yiliu30 in #346
- [Critical Bug]API use sym as default by @wenhuach21 in #349
- triton backend requires< 3.0 by @wenhuach21 in #348
Full Changelog: v0.4...v0.4.1