Skip to content

Conversation

@lyogavin
Copy link

@lyogavin lyogavin commented Nov 24, 2025

Add Block-wise INT8 Quantization

This PR adds deepseek-style block-wise INT8 quantization support to ComfyUI, enabling ~50% memory reduction with limited accuracy loss and improved performance on large layers.

How it works:

Similar to our current scaled fp8, but change the scale to block based, split the given tensor into blocks and save the scale value for each block.

Note that the implementation is based on this, which is asymmetric for activation and weights. For activate only split into blocks on the last dimension, while for weights it's across last 2 dimensions.

more papers/code for reference:
Jetfire
Jetfire Repo
Deepseek v3 paper
Deepseek block-wise scaled fp8 implementation
Deepseek Deepgemm

Changes:

Core Implementation

  • BlockWiseINT8Layout: New quantization format with per-block scaling based on QuantizedLayout mechanism (here)
  • Triton-optimized CUDA kernels with PyTorch fallbacks (referencing this implementation)
  • Configurable block size (default: 128)
  • Make some necessary changes in weights adapter codes that referencing the internal data dtype
  • Changed MixedPrecisionOps implementation to make it load the new fields(is_weight) for the new layout

Tests

  • added unit tests intests-unit/comfy_quant/test_quant_registry.py: verify errors for quantization/dequantization, verify errors for all ops like gemm, gelu, etc. and also added run time benchmarking (disabled by default)

Performance Benchmarks (RTX 4090)

  • Memory: ~50% reduction vs FP16
  • Speed:
    • On RTX 4090 48GB vram Wan t2v model (in sec):
      • 5b
        • Bwscaled Int8 with linear+gelu+transpose: 168.05
        • Bwscaled Int8 With linear+gelu: 167.12
        • Bwscaled Int8 With linear: 171.33
        • fb16: 178.77
      • 14b ( 20 steps)
        • Bwscaled Int8 with linear+gelu+transpose: 233.23
        • Bwscaled Int8 With linear+gelu: 233.86
        • Bwscaled Int8 With linear: 236.90
        • fb8 scaled: 209.39
        • Fb16: 253.19
      • 14b (on 4090 24GB VRam, 5 steps with step distillation):
        • Int8: 85.72 sec (no offload)
        • Fp16: 106.9 sec (10242MB offload)
        • FP8: 80 sec (no offload)


More detailed perf benchmark, precison comparison and memory consumption comparison on Wan video model sizes:

================================================================================
Summary: FP16 vs INT8 vs FP8 Performance
================================================================================

WAN2.2-5B:
Layer                     FP16       INT8       FP8        Speedup              Mem     
--------------------------------------------------------------------------------
First layer (small batch)      0.146ms    0.268ms    0.117ms INT8: 0.54x FP8: 1.25x   2.00x
Attention layer (long seq)      6.331ms    6.549ms    5.519ms INT8: 0.97x FP8: 1.15x   1.94x
MLP down projection (long seq)     30.536ms   23.795ms   18.422ms INT8: 1.28x FP8: 1.66x   1.94x
Attention layer (medium seq)     0.149ms    0.246ms    0.160ms INT8: 0.61x FP8: 0.93x   1.98x
--------------------------------------------------------------------------------
SUBTOTAL                    37.162ms   30.857ms   24.218ms INT8: 1.20x FP8: 1.53x
  WAN2.2-5B avg memory reduction: 1.97x
  WAN2.2-5B avg INT8 precision error: 0.179672
  WAN2.2-5B avg FP8 precision error: 0.389072
  WAN2.2-5B VRAM usage: FP16 6138.19MB, INT8 7189.63MB (during inference with both)

WAN2.2-14B:
Layer                     FP16       INT8       FP8        Speedup              Mem     
--------------------------------------------------------------------------------
First layer (small batch)      0.360ms    0.395ms    0.268ms INT8: 0.91x FP8: 1.34x   2.00x
Attention layer (long seq)     17.401ms   15.633ms   12.488ms INT8: 1.11x FP8: 1.39x   1.94x
Attention layer (medium seq)     0.366ms    0.357ms    0.262ms INT8: 1.02x FP8: 1.40x   1.99x
--------------------------------------------------------------------------------
SUBTOTAL                    18.127ms   16.385ms   13.018ms INT8: 1.11x FP8: 1.39x
  WAN2.2-14B avg memory reduction: 1.98x
  WAN2.2-14B avg INT8 precision error: 0.190389
  WAN2.2-14B avg FP8 precision error: 0.365195
  WAN2.2-14B VRAM usage: FP16 2829.11MB, INT8 3310.05MB (during inference with both)

Conclusion: INT8 is slower than FP8 but faster than FP16, precision is better than FP8, similar memory consumption as FP8

Usage

from comfy.quant_ops import QuantizedTensor

weight_int8 = QuantizedTensor.from_float(
    weight, 
    "BlockWiseINT8Layout",
    block_size=128,
    is_weight=True
)

# to dequantize:
weight_float = weight_int8.dequantize()

# below will internally trigger the IN8 based linear operation:
output = torch.nn.functional.linear(input, weight_int8)

actual ComfyUI workflow test:

I've uploaded some quantized Wan2.2 models in here, and create this sample workflow

Generation result:
https://github.com/user-attachments/assets/35227283-f8b6-4b7c-af18-6d86e6ed18f6

Open Issue:

Lightx2v LORA not working

I tested loading lora based on quantized model in this workflow, it doesn't work. I did a lot of debug, compared inputs and outputs of every layer, all look good.

So my current guess is this kind of block-wised int8 quantization's error is higher than original scaled FP8 for some data distribution.

I did some more tests on the quantization/dequantization errors on model+lora in here:compare_lora_error.ipynb

From the tests, it seems like for original Wan Model, the new int8 quantization's error is smaller, but if we load lora and then quantize(the actual form it'll be when running), the error becomes bigger than scaled FP8.

UPDATES:
I've added more unit tests to make sure it's not some bugs, it's just the error accumulation from 2 times of quantization/de-quantization.

I've tested by merge the LORA to the original model first, then quantize to INT8 format, then load in comfy workflow, it works. The "Wan2.2 T2V with LightX2V Lora merged in then quantized as INT8" models can be found here: low noise and high noise. So: 2 times of quantization the error would be too high, one quantization would work.

Feedbacks/Suggestions are welcome!

Tool to convert Model to Block-wise scaled INT8 format:

convert_to_int8_blockwise.py

@Kosinkadink
Copy link
Collaborator

Kosinkadink commented Nov 25, 2025

The changes for flux2 caused a slight conflict, could you take a look? Thanks!

@lyogavin lyogavin force-pushed the support_int8_quantization branch 2 times, most recently from 597ab49 to 16c2dfa Compare November 25, 2025 21:06
@lyogavin
Copy link
Author

The changes for flux2 caused a slight conflict, could you take a look? Thanks!

Sure. I've resolved the conflicts. Thanks.

@lyogavin lyogavin force-pushed the support_int8_quantization branch 6 times, most recently from 7f9a65c to 86c0361 Compare November 26, 2025 18:41
weight = weight_decompose(dora_scale, weight, lora_diff, alpha, strength, intermediate_dtype, function)
else:
weight += function((strength * lora_diff).type(weight.dtype))
weight += function((strength * lora_diff).type(weight.dtype if not isinstance(weight, QuantizedTensor) else torch.float32))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are needed in the latest commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little tricky, the weight.dtype returns torch.int, so if we don't add this, it'll convert the LORA diff to int.

Here's an example:

M, N = 256, 512
weight = torch.randn(M, N, dtype=torch.float32, device='cuda')
        
int8v = QuantizedTensor.from_float(weight, layout_type='BlockWiseINT8Layout', is_weight=True)
fp8v = QuantizedTensor.from_float(weight, layout_type='TensorCoreFP8Layout')

print(int8v.dtype, fp8v.dtype)
# output:
# torch.int8 torch.float8_e4m3fn

so the codes that directly access dtype might cause some potential issues.

Maybe we should try to find a way to override Tensor.dtype?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the most recent code the lora stuff is called after doing QuantizedTensor.dequantize

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it does: convert_weight() -> apply the lora -> set_weight()

The convert and set weight functions are: https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/ops.py#L502

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check. Thanks.

…anism

add more tests by comparing with manual torch implementation

add perf benchmarks

fix errors caused by merging

default no output quant

fix unittest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants