Skip to content

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Aug 25, 2025

Refer vllm-project/compressed-tensors#400

Bench code: https://gist.github.com/yiliu30/5535ac154cdd000d731a9fdd385b5df8
Benchmark results

  • cpu
/home/yliu7/workspace/inc/3rd-party/torchao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_5 is deprecated and will be removed in torchao 0.14.0
  warnings.warn(self.msg)
Got model shape: {torch.Size([2048, 768]), torch.Size([768, 2048]), torch.Size([128, 2048]), torch.Size([512, 2048]), torch.Size([4096, 2048]), torch.Size([151936, 2048]), torch.Size([2048, 4096])}
Benchmarking packing for shape: torch.Size([2048, 768])
Old packing time: 20.764774 seconds
New packing time: 2.885279 seconds
Speedup: 7.20x
Benchmarking packing for shape: torch.Size([768, 2048])
Old packing time: 55.343031 seconds
New packing time: 0.369089 seconds
Speedup: 149.94x
Benchmarking packing for shape: torch.Size([128, 2048])
Old packing time: 7.780846 seconds
New packing time: 0.143778 seconds
Speedup: 54.12x
Benchmarking packing for shape: torch.Size([512, 2048])
Old packing time: 15.839768 seconds
New packing time: 3.432988 seconds
Speedup: 4.61x
Benchmarking packing for shape: torch.Size([4096, 2048])
Old packing time: 229.266429 seconds
New packing time: 8.651233 seconds
Speedup: 26.50x
Benchmarking packing for shape: torch.Size([151936, 2048])
Old packing time: 7915.168945 seconds
New packing time: 288.358398 seconds
Speedup: 27.45x
Benchmarking packing for shape: torch.Size([2048, 4096])
Old packing time: 232.400127 seconds
New packing time: 8.614367 seconds
Speedup: 26.98x
  • cuda
p packing_fp4.py 
/home/yliu7/workspace/inc/3rd-party/torchao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_5 is deprecated and will be removed in torchao 0.14.0
  warnings.warn(self.msg)
Got model shape: {torch.Size([2048, 768]), torch.Size([768, 2048]), torch.Size([128, 2048]), torch.Size([512, 2048]), torch.Size([4096, 2048]), torch.Size([151936, 2048]), torch.Size([2048, 4096])}
Benchmarking packing for shape: torch.Size([2048, 768])
Old packing time: 1.520101 seconds
New packing time: 0.042111 seconds
Speedup: 36.10x
Benchmarking packing for shape: torch.Size([768, 2048])
Old packing time: 1.518886 seconds
New packing time: 0.042597 seconds
Speedup: 35.66x
Benchmarking packing for shape: torch.Size([128, 2048])
Old packing time: 1.554410 seconds
New packing time: 0.043964 seconds
Speedup: 35.36x
Benchmarking packing for shape: torch.Size([512, 2048])
Old packing time: 1.626152 seconds
New packing time: 0.036053 seconds
Speedup: 45.10x
Benchmarking packing for shape: torch.Size([4096, 2048])
Old packing time: 3.776364 seconds
New packing time: 0.220520 seconds
Speedup: 17.12x
Benchmarking packing for shape: torch.Size([151936, 2048])
Old packing time: 122.678399 seconds
New packing time: 7.589126 seconds
Speedup: 16.17x
Benchmarking packing for shape: torch.Size([2048, 4096])
Old packing time: 3.770036 seconds
New packing time: 0.227194 seconds
Speedup: 16.59x

Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
@yiliu30 yiliu30 requested a review from Copilot August 25, 2025 02:57
@yiliu30 yiliu30 marked this pull request as ready for review August 25, 2025 03:14
@yiliu30 yiliu30 requested a review from WeiweiZhang1 August 25, 2025 03:14

This comment was marked as outdated.

@yiliu30 yiliu30 requested a review from Copilot August 25, 2025 05:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes FP4 packing performance by replacing a slow iterative loop with vectorized tensor operations and adding PyTorch compilation. The optimization significantly improves packing speed, achieving 7-150x speedup on CPU and 16-45x speedup on CUDA.

  • Replaces iterative loop-based index finding with vectorized torch.argmin operation
  • Adds @torch.compile decorator for further performance optimization
  • Converts instance method to standalone function for better compilation compatibility

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
auto_round/export/export_to_autoround/qlinear_fp.py Refactors FP4 packing function with vectorized operations and torch.compile optimization
test/test_cuda/test_packing.py Adds comprehensive test comparing old and new packing implementations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@yiliu30 yiliu30 merged commit 3648639 into main Aug 25, 2025
3 checks passed
@yiliu30 yiliu30 deleted the speedup-packing branch August 25, 2025 06:36
@wenhuach21
Copy link
Contributor

wenhuach21 commented Aug 25, 2025

From my test with the AutoRound format, torch compile only provides noticeable benefits for large models or MoE models, since compilation takes additional time. For a 7B model, packing w/o torch compile typically takes only about 15 seconds, but with torch compile, it costs 30s

yiliu30 added a commit that referenced this pull request Aug 26, 2025
wenhuach21 pushed a commit that referenced this pull request Aug 26, 2025
@yiliu30 yiliu30 restored the speedup-packing branch August 26, 2025 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants