-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[QNN EP] Add LowPowerBlockQuantization support for Gemm node #25458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QNN EP] Add LowPowerBlockQuantization support for Gemm node #25458
Conversation
- Low Power Block Quantization(LPBQ) is widely used to accelerate accuracy sensitive models via QNN(Qualcomm Neural Network) stack. - LPBQ encoding format is Qualcomm's alternative for BlockQuantization technique. - The current implementation expects LPBQ encodings packed in a node sequence (DQ -> Q -> DQ) - This PR folds LPBQ pattern on Weight of Gemm nodes into a Qnn BlockExpansion encoding structure. - This PR adds INT4 Quantization support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Low Power Block Quantization (LPBQ) support for Gemm nodes in the QNN (Qualcomm Neural Network) execution provider. LPBQ is an alternative block quantization technique that enables acceleration of accuracy-sensitive models by avoiding CPU fallback for block-quantized tensors.
- Introduces a new fusion pattern to detect DQ->Q->DQ sequences on Gemm weights and convert them to QNN's BlockExpansion encoding
- Adds INT4 quantization support through specialized template traits and quantization functions
- Extends the quantization parameter wrapper to handle LPBQ encodings with per-channel float scales and per-block integer scales
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
File | Description |
---|---|
qnn_utils.h | Adds LPBQ data quantization function and Int4 quantization traits |
qnn_utils.cc | Implements LowPowerBlockQuantizeData function for LPBQ encoding |
qnn_quant_params_wrapper.h | Extends wrapper to support LPBQ quantization parameters |
qnn_quant_params_wrapper.cc | Implements LPBQ constructor and deep copy logic |
utils.h/cc | Adds utility functions for parent/child node traversal in fusion detection |
qnn_node_group.cc | Registers LPBQ Gemm fusion and updates fusion dispatch logic |
lpbqgemm_fusion.h/cc | Implements the LPBQ Gemm fusion pattern detection and QNN node creation |
qnn_model_wrapper.h/cc | Templated UnpackScales function to support both float and uint8_t scales |
onnxruntime/core/providers/qnn/builder/qnn_node_group/lpbqgemm_fusion.cc
Outdated
Show resolved
Hide resolved
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
There are build errors in both Linux QNN and Windows ARM64 QNN CI pipelines |
- Fixes Linux build error - fix documentation for a function
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
Motivation and Context