-
Notifications
You must be signed in to change notification settings - Fork 10.8k
add block-wise scaled int8 quantization based on QuantizedLayout mechanism #10864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The changes for flux2 caused a slight conflict, could you take a look? Thanks! |
597ab49 to
16c2dfa
Compare
Sure. I've resolved the conflicts. Thanks. |
7f9a65c to
86c0361
Compare
| weight = weight_decompose(dora_scale, weight, lora_diff, alpha, strength, intermediate_dtype, function) | ||
| else: | ||
| weight += function((strength * lora_diff).type(weight.dtype)) | ||
| weight += function((strength * lora_diff).type(weight.dtype if not isinstance(weight, QuantizedTensor) else torch.float32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are needed in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little tricky, the weight.dtype returns torch.int, so if we don't add this, it'll convert the LORA diff to int.
Here's an example:
M, N = 256, 512
weight = torch.randn(M, N, dtype=torch.float32, device='cuda')
int8v = QuantizedTensor.from_float(weight, layout_type='BlockWiseINT8Layout', is_weight=True)
fp8v = QuantizedTensor.from_float(weight, layout_type='TensorCoreFP8Layout')
print(int8v.dtype, fp8v.dtype)
# output:
# torch.int8 torch.float8_e4m3fnso the codes that directly access dtype might cause some potential issues.
Maybe we should try to find a way to override Tensor.dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the most recent code the lora stuff is called after doing QuantizedTensor.dequantize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically it does: convert_weight() -> apply the lora -> set_weight()
The convert and set weight functions are: https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/ops.py#L502
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check. Thanks.
…anism add more tests by comparing with manual torch implementation add perf benchmarks fix errors caused by merging default no output quant fix unittest
86c0361 to
3322d21
Compare
Add Block-wise INT8 Quantization
This PR adds deepseek-style block-wise INT8 quantization support to ComfyUI, enabling ~50% memory reduction with limited accuracy loss and improved performance on large layers.
How it works:
Similar to our current scaled fp8, but change the scale to block based, split the given tensor into blocks and save the scale value for each block.
Note that the implementation is based on this, which is asymmetric for activation and weights. For activate only split into blocks on the last dimension, while for weights it's across last 2 dimensions.
more papers/code for reference:
Jetfire
Jetfire Repo
Deepseek v3 paper
Deepseek block-wise scaled fp8 implementation
Deepseek Deepgemm
Changes:
Core Implementation
BlockWiseINT8Layout: New quantization format with per-block scaling based on QuantizedLayout mechanism (here)Tests
tests-unit/comfy_quant/test_quant_registry.py: verify errors for quantization/dequantization, verify errors for all ops like gemm, gelu, etc. and also added run time benchmarking (disabled by default)Performance Benchmarks (RTX 4090)
More detailed perf benchmark, precison comparison and memory consumption comparison on Wan video model sizes:
Conclusion: INT8 is slower than FP8 but faster than FP16, precision is better than FP8, similar memory consumption as FP8
Usage
actual ComfyUI workflow test:
I've uploaded some quantized Wan2.2 models in here, and create this sample workflow
Generation result:
https://github.com/user-attachments/assets/35227283-f8b6-4b7c-af18-6d86e6ed18f6
Open Issue:
Lightx2v LORA not working
I tested loading lora based on quantized model in this workflow, it doesn't work. I did a lot of debug, compared inputs and outputs of every layer, all look good.
So my current guess is this kind of block-wised int8 quantization's error is higher than original scaled FP8 for some data distribution.
I did some more tests on the quantization/dequantization errors on model+lora in here:compare_lora_error.ipynb
From the tests, it seems like for original Wan Model, the new int8 quantization's error is smaller, but if we load lora and then quantize(the actual form it'll be when running), the error becomes bigger than scaled FP8.
UPDATES:
I've added more unit tests to make sure it's not some bugs, it's just the error accumulation from 2 times of quantization/de-quantization.
I've tested by merge the LORA to the original model first, then quantize to INT8 format, then load in comfy workflow, it works. The "Wan2.2 T2V with LightX2V Lora merged in then quantized as INT8" models can be found here: low noise and high noise. So: 2 times of quantization the error would be too high, one quantization would work.
Feedbacks/Suggestions are welcome!
Tool to convert Model to Block-wise scaled INT8 format:
convert_to_int8_blockwise.py