Skip to content

[QNN EP] Add LowPowerBlockQuantization support for Gemm node #25458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 21, 2025

Conversation

quic-tirupath
Copy link
Contributor

Description

  • Low Power Block Quantization(LPBQ) is widely used to accelerate accuracy sensitive models via QNN(Qualcomm Neural Network) stack.
  • LPBQ encoding format is Qualcomm's alternative for BlockQuantization technique.
  • The current implementation expects LPBQ encodings packed in a node sequence (DQ -> Q -> DQ)
  • This PR folds LPBQ pattern on Weight of Gemm nodes into a Qnn BlockExpansion encoding structure.
  • This PR adds INT4 Quantization support

Motivation and Context

  • This enables acceleration of accuracy sensitive models that require block quantization kind of encodings via QNN EP
  • This avoids fallback of nodes consuming block quantized tensors on CPU EP and further improves inference time.

 - Low Power Block Quantization(LPBQ) is widely used to accelerate accuracy sensitive
   models via QNN(Qualcomm Neural Network) stack.
 - LPBQ encoding format is Qualcomm's alternative for BlockQuantization technique.
 - The current implementation expects LPBQ encodings packed in a node sequence (DQ -> Q -> DQ)
 - This PR folds LPBQ pattern on Weight of Gemm nodes into a Qnn BlockExpansion encoding
   structure.
 - This PR adds INT4 Quantization support
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Low Power Block Quantization (LPBQ) support for Gemm nodes in the QNN (Qualcomm Neural Network) execution provider. LPBQ is an alternative block quantization technique that enables acceleration of accuracy-sensitive models by avoiding CPU fallback for block-quantized tensors.

  • Introduces a new fusion pattern to detect DQ->Q->DQ sequences on Gemm weights and convert them to QNN's BlockExpansion encoding
  • Adds INT4 quantization support through specialized template traits and quantization functions
  • Extends the quantization parameter wrapper to handle LPBQ encodings with per-channel float scales and per-block integer scales

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
qnn_utils.h Adds LPBQ data quantization function and Int4 quantization traits
qnn_utils.cc Implements LowPowerBlockQuantizeData function for LPBQ encoding
qnn_quant_params_wrapper.h Extends wrapper to support LPBQ quantization parameters
qnn_quant_params_wrapper.cc Implements LPBQ constructor and deep copy logic
utils.h/cc Adds utility functions for parent/child node traversal in fusion detection
qnn_node_group.cc Registers LPBQ Gemm fusion and updates fusion dispatch logic
lpbqgemm_fusion.h/cc Implements the LPBQ Gemm fusion pattern detection and QNN node creation
qnn_model_wrapper.h/cc Templated UnpackScales function to support both float and uint8_t scales

@jywu-msft
Copy link
Member

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@jywu-msft
Copy link
Member

There are build errors in both Linux QNN and Windows ARM64 QNN CI pipelines

@jywu-msft jywu-msft added the ep:QNN issues related to QNN exeution provider label Jul 19, 2025
 - Fixes Linux build error
 - fix documentation for a function
@HectorSVC
Copy link
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows x64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Copy link
Contributor

@HectorSVC HectorSVC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@HectorSVC HectorSVC merged commit 91e9118 into microsoft:main Jul 21, 2025
103 of 114 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:QNN issues related to QNN exeution provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants