Speedup FP4 packing #760

yiliu30 · 2025-08-25T02:57:29Z

Refer vllm-project/compressed-tensors#400

Bench code: https://gist.github.com/yiliu30/5535ac154cdd000d731a9fdd385b5df8
Benchmark results

cpu

/home/yliu7/workspace/inc/3rd-party/torchao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_5 is deprecated and will be removed in torchao 0.14.0
  warnings.warn(self.msg)
Got model shape: {torch.Size([2048, 768]), torch.Size([768, 2048]), torch.Size([128, 2048]), torch.Size([512, 2048]), torch.Size([4096, 2048]), torch.Size([151936, 2048]), torch.Size([2048, 4096])}
Benchmarking packing for shape: torch.Size([2048, 768])
Old packing time: 20.764774 seconds
New packing time: 2.885279 seconds
Speedup: 7.20x
Benchmarking packing for shape: torch.Size([768, 2048])
Old packing time: 55.343031 seconds
New packing time: 0.369089 seconds
Speedup: 149.94x
Benchmarking packing for shape: torch.Size([128, 2048])
Old packing time: 7.780846 seconds
New packing time: 0.143778 seconds
Speedup: 54.12x
Benchmarking packing for shape: torch.Size([512, 2048])
Old packing time: 15.839768 seconds
New packing time: 3.432988 seconds
Speedup: 4.61x
Benchmarking packing for shape: torch.Size([4096, 2048])
Old packing time: 229.266429 seconds
New packing time: 8.651233 seconds
Speedup: 26.50x
Benchmarking packing for shape: torch.Size([151936, 2048])
Old packing time: 7915.168945 seconds
New packing time: 288.358398 seconds
Speedup: 27.45x
Benchmarking packing for shape: torch.Size([2048, 4096])
Old packing time: 232.400127 seconds
New packing time: 8.614367 seconds
Speedup: 26.98x

cuda

p packing_fp4.py 
/home/yliu7/workspace/inc/3rd-party/torchao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_5 is deprecated and will be removed in torchao 0.14.0
  warnings.warn(self.msg)
Got model shape: {torch.Size([2048, 768]), torch.Size([768, 2048]), torch.Size([128, 2048]), torch.Size([512, 2048]), torch.Size([4096, 2048]), torch.Size([151936, 2048]), torch.Size([2048, 4096])}
Benchmarking packing for shape: torch.Size([2048, 768])
Old packing time: 1.520101 seconds
New packing time: 0.042111 seconds
Speedup: 36.10x
Benchmarking packing for shape: torch.Size([768, 2048])
Old packing time: 1.518886 seconds
New packing time: 0.042597 seconds
Speedup: 35.66x
Benchmarking packing for shape: torch.Size([128, 2048])
Old packing time: 1.554410 seconds
New packing time: 0.043964 seconds
Speedup: 35.36x
Benchmarking packing for shape: torch.Size([512, 2048])
Old packing time: 1.626152 seconds
New packing time: 0.036053 seconds
Speedup: 45.10x
Benchmarking packing for shape: torch.Size([4096, 2048])
Old packing time: 3.776364 seconds
New packing time: 0.220520 seconds
Speedup: 17.12x
Benchmarking packing for shape: torch.Size([151936, 2048])
Old packing time: 122.678399 seconds
New packing time: 7.589126 seconds
Speedup: 16.17x
Benchmarking packing for shape: torch.Size([2048, 4096])
Old packing time: 3.770036 seconds
New packing time: 0.227194 seconds
Speedup: 16.59x

Signed-off-by: yiliu30 <[email protected]>

auto_round/export/export_to_autoround/qlinear_fp.py

Copilot

Pull Request Overview

This PR optimizes FP4 packing performance by replacing a slow iterative loop with vectorized tensor operations and adding PyTorch compilation. The optimization significantly improves packing speed, achieving 7-150x speedup on CPU and 16-45x speedup on CUDA.

Replaces iterative loop-based index finding with vectorized torch.argmin operation
Adds @torch.compile decorator for further performance optimization
Converts instance method to standalone function for better compilation compatibility

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
auto_round/export/export_to_autoround/qlinear_fp.py	Refactors FP4 packing function with vectorized operations and torch.compile optimization
test/test_cuda/test_packing.py	Adds comprehensive test comparing old and new packing implementations

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

test/test_cuda/test_packing.py

wenhuach21 · 2025-08-25T06:48:25Z

From my test with the AutoRound format, torch compile only provides noticeable benefits for large models or MoE models, since compilation takes additional time. For a 7B model, packing w/o torch compile typically takes only about 15 seconds, but with torch compile, it costs 30s

This reverts commit 3648639.

yiliu30 added 3 commits August 24, 2025 22:29

speedup packing

a89bbdc

Signed-off-by: yiliu30 <[email protected]>

add ut

fd3ed6b

Signed-off-by: yiliu30 <[email protected]>

add note

5389b8c

Signed-off-by: yiliu30 <[email protected]>

yiliu30 requested a review from Copilot August 25, 2025 02:57

Merge branch 'main' into speedup-packing

127f47b

wenhuach21 reviewed Aug 25, 2025

View reviewed changes

auto_round/export/export_to_autoround/qlinear_fp.py Show resolved Hide resolved

wenhuach21 reviewed Aug 25, 2025

View reviewed changes

auto_round/export/export_to_autoround/qlinear_fp.py Show resolved Hide resolved

wenhuach21 approved these changes Aug 25, 2025

View reviewed changes

yiliu30 marked this pull request as ready for review August 25, 2025 03:14

yiliu30 requested a review from WeiweiZhang1 August 25, 2025 03:14

This comment was marked as outdated.

Sign in to view

yiliu30 requested a review from Copilot August 25, 2025 05:34

Copilot AI reviewed Aug 25, 2025

View reviewed changes

test/test_cuda/test_packing.py Show resolved Hide resolved

WeiweiZhang1 approved these changes Aug 25, 2025

View reviewed changes

yiliu30 merged commit 3648639 into main Aug 25, 2025
3 checks passed

yiliu30 deleted the speedup-packing branch August 25, 2025 06:36

yiliu30 added a commit that referenced this pull request Aug 26, 2025

Revert "Speedup FP4 packing (#760)"

6cbece0

This reverts commit 3648639.

yiliu30 mentioned this pull request Aug 26, 2025

Revert "Speedup FP4 packing" #763

Merged

wenhuach21 pushed a commit that referenced this pull request Aug 26, 2025

Revert "Speedup FP4 packing (#760)" (#763)

ec76f61

This reverts commit 3648639.

yiliu30 restored the speedup-packing branch August 26, 2025 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Speedup FP4 packing #760

Speedup FP4 packing #760

Uh oh!

yiliu30 commented Aug 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Aug 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Speedup FP4 packing #760

Speedup FP4 packing #760

Uh oh!

Conversation

yiliu30 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiliu30 commented Aug 25, 2025 •

edited

Loading

wenhuach21 commented Aug 25, 2025 •

edited

Loading