-
Couldn't load subscription status.
- Fork 57
Speedup FP4 packing #760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup FP4 packing #760
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes FP4 packing performance by replacing a slow iterative loop with vectorized tensor operations and adding PyTorch compilation. The optimization significantly improves packing speed, achieving 7-150x speedup on CPU and 16-45x speedup on CUDA.
- Replaces iterative loop-based index finding with vectorized
torch.argminoperation - Adds
@torch.compiledecorator for further performance optimization - Converts instance method to standalone function for better compilation compatibility
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| auto_round/export/export_to_autoround/qlinear_fp.py | Refactors FP4 packing function with vectorized operations and torch.compile optimization |
| test/test_cuda/test_packing.py | Adds comprehensive test comparing old and new packing implementations |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
From my test with the AutoRound format, torch compile only provides noticeable benefits for large models or MoE models, since compilation takes additional time. For a 7B model, packing w/o torch compile typically takes only about 15 seconds, but with torch compile, it costs 30s |
Refer vllm-project/compressed-tensors#400
Bench code: https://gist.github.com/yiliu30/5535ac154cdd000d731a9fdd385b5df8
Benchmark results