Skip to content

Conversation

@brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Aug 21, 2025

In order to support multi-modifier recipes (e.g. AWQ+W4A16 on self_attn layers and FP8_DYNAMIC on mlp layers), quantization config and status must be applied only to the modules scoped to the modifier, not all at once. This updates apply_quantization_config so that quantization_config and quantization_status are applied just to the target modules, not changed globally across all modules.

In order for proper target prioritization, apply_quantization_status is performed regardless of what the current status is for the model. Without these changes, test_target_prioritization will fail.

Other small changes:

  • Added a test_multi_apply_quantization_config to make sure the application of multiple quantization configs in series works correctly -- shapes are correct and unused parameters are correctly removed.
  • Drop override_quantization_status in favor of more general patch_attr.
  • Removed infer_quantization_status which is no longer meaningful at the model level. It is also no longer needed because module's current status isn't checked.
  • Added ALL_QPARAM_NAMES constant so that parameters related to quantization can be cleared from modules during init
  • Removed all references to "quant_method": "sparseml" in favor of "compressed-tensors"
  • Dropped usage of compress_quantized_weights and apply_quantization_status. We can remove compress_quantized_weights and references to it in examples/notebooks in a follow-up PR
  • Also updated tests to get rid of warnings:
tests/test_compressors/quantized_compressors/test_fp8_quant.py::test_quant_format[channel-None-sc2-zp2]
  /home/runner/work/compressed-tensors/compressed-tensors/tests/test_compressors/quantized_compressors/test_fp8_quant.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
    "dummy.weight_scale": torch.tensor(sc, dtype=torch.float32),

Merge in conjunction with

@brian-dellabetta brian-dellabetta changed the title [Mulit-Modifier] Scoped apply quantization config [Multi-Modifier] Scoped apply quantization config Aug 21, 2025
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/scoped-quant-status branch 2 times, most recently from 03fb664 to 550c0ad Compare August 21, 2025 19:18
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/scoped-quant-status branch from f70aedb to 606f177 Compare August 21, 2025 19:38
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@kylesayrs
Copy link
Collaborator

kylesayrs commented Aug 25, 2025

FYI #428. Also touches some apply logic and adds more scheme merging

Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/scoped-quant-status branch from 24af65a to 8259cbb Compare August 28, 2025 17:00
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/scoped-quant-status branch from 8259cbb to b515c1b Compare August 28, 2025 17:00
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/scoped-quant-status branch from 92f8757 to d2903a1 Compare September 15, 2025 17:20
Signed-off-by: Brian Dellabetta <[email protected]>
kylesayrs
kylesayrs previously approved these changes Sep 15, 2025
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

rahul-tuli
rahul-tuli previously approved these changes Sep 16, 2025
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! LGTM! 🚀

Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/scoped-quant-status branch from f7239b1 to 01af659 Compare September 18, 2025 16:48
Signed-off-by: Brian Dellabetta <[email protected]>
kylesayrs
kylesayrs previously approved these changes Sep 18, 2025
Signed-off-by: Brian Dellabetta <[email protected]>
@dsikka dsikka merged commit dfd069b into main Sep 18, 2025
2 checks passed
@dsikka dsikka deleted the bdellabe/scoped-quant-status branch September 18, 2025 19:10
brian-dellabetta added a commit to vllm-project/llm-compressor that referenced this pull request Sep 22, 2025
…atus (#1772)

SUMMARY:
Prerequisites:
* vllm-project/compressed-tensors#432

This allows for multi-modifier support by scoping the application of
quantization config/status to only the modules in the model that match
the given targets/ignore configuration, rather than all modules.
Initialization of observers is moved to on_start (instead of
on_initialize) to match their removal on_end (and not on_finalize). This
prevents collision during the multi-modifier lifecycle

- [x] Update AWQ
- [x] Update QuantizationModifier
- [x] Update QuantizationMixin
- [x] Update GPTQ
- [x] No other quantization modifiers exist


TEST PLAN:
- Tests were added to
vllm-project/compressed-tensors#432 to confirm
correct application of multiple modifiers.
- Added an example in this PR to show how AWQ and GPTQ can be applied
heterogeneously to a model, along with a small README. Logs show
alternating AWQ and GPTQ messages for `"sequential"`, and correct
behavior for `"independent"` pipelines. [Model
checkpoint](https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-selfattn-w8a8-mlp-w4a16-sequential/tree/main)
for the sequential pipeline shows correct application of W8A8 to
self_attn layers and W4A16 to mlp layers. config.json and safetensors
weights all look as expected

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants