Skip to content

Conversation

@bopeng1234
Copy link
Contributor

Details:

  • Enabling the model that uses 1x1 kernel convolution with a dequantize pattern.
  • Adding the pass ConvertConvolutionToMatMul to convert the 1x1 kernel conv to matmul, then reuse MarkDequantization and related patterns
  • if not, the int4 weight will be const folding to fp32
image

Tickets:

… matmul. Then met MarkDequantization pattern
@github-actions github-actions bot added category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations labels Nov 21, 2025
@bopeng1234 bopeng1234 marked this pull request as ready for review November 21, 2025 09:45
@bopeng1234 bopeng1234 requested review from a team as code owners November 21, 2025 09:45
@bopeng1234 bopeng1234 requested review from CuriousPanCake and removed request for a team November 21, 2025 09:45
@maxnick maxnick requested a review from v-Golubev November 24, 2025 13:44
@maxnick maxnick assigned maxnick and v-Golubev and unassigned maxnick Nov 24, 2025
@maxnick
Copy link
Contributor

maxnick commented Nov 24, 2025

@v-Golubev , could you please review this PR?

namespace {
std::shared_ptr<Model> create_conv_function(const Shape& input_shape,
const Shape& weights_shape,
bool with_dequantization) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

Suggested change
bool with_dequantization) {
bool with_decompression) {

As far as I see, this feature is targeted on decompressed weights + float activations scenario. Let's reflect it in the builder name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to decompression

using namespace testing;

namespace {
std::shared_ptr<Model> create_conv_function(const Shape& input_shape,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test cases with dynamic input_shape or modify the existing ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added dynamic shape testcase

auto weights_convert = std::make_shared<op::v0::Convert>(weights_const, element::f32);
auto sub_const = op::v0::Constant::create(element::i4, {1}, {1});
auto sub_convert = std::make_shared<op::v0::Convert>(sub_const, element::f32);
auto subtract = std::make_shared<op::v1::Subtract>(weights_convert, sub_convert);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConvertConvolutionToMatMul matcher covers more complicated cases as well (e.g. zp_reshape or weights_sub_multiply_reshape). Can we cover them by tests?
You can try to reuse ov::test::utils::initMatMulDecompressionSubgraph to avoid code duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reused initMatMulDecompressionSubgraph and covered more cases

Comment on lines 47 to 66
// weights should be static 1x1 kernel, [hidden_out, hidden_in, 1, 1]
const auto& weights_shape = conv_node->get_input_partial_shape(1);
if (weights_shape.is_dynamic()) {
return false;
}
if (weights_shape.size() != 4 || weights_shape[2] != 1 || weights_shape[3] != 1) {
return false;
}

// input should met: [seq_len, hidden_in, 1, 1], [1, hidden_in, 1, seq_len] or [1, hidden_in, seq_len, 1]
const auto& input_shape = conv_node->get_input_partial_shape(0);
if (input_shape.rank().get_length() != 4) {
return false;
}
const bool is_supported_shape = (input_shape[2] == 1 && input_shape[3] == 1) ||
(input_shape[0] == 1 && input_shape[2] == 1) ||
(input_shape[0] == 1 && input_shape[3] == 1);
if (!is_supported_shape) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shape checks can be done without custom predicate logic using ov::pass::pattern::shape_matches predicate which uses symbolics feature. Can we try to reuse it here to simplify the code?
Please see example in src/common/transformations/src/transformations/common_optimizations/fuse_rotary_positional_embeddings.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, reused shape_matches for the input shape.

Comment on lines 69 to 71
return conv_node->get_strides() == ov::Strides{1, 1} && conv_node->get_dilations() == ov::Strides{1, 1} &&
conv_node->get_pads_begin() == ov::CoordinateDiff{0, 0} &&
conv_node->get_pads_end() == ov::CoordinateDiff{0, 0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Shouldn't we check auto_pad value here?
  2. We can pass the desired attributes as a separate wrap_type parameter and avoid a custom predicate introduction. Could you please try this? As a reference, you can use mlp3_no_bias_swiglu_block block in src/common/transformations/src/transformations/common_optimizations/fuse_moe_experts.cpp: Attributes are set as last wrap_type parameter (e.g. {{"mode", "bidirectional"}} for broadcast node)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, add auto_pad check, and use attri check with warp_type.

ov::pass::Manager decompression_handling_manager("CPU:DecompressionHandling");
decompression_handling_manager.set_per_pass_validation(false);
CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::ConvertConvolutionToMatMul);
CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::EliminateReshape);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep InitNodeInfo first in the pipeline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

position moved

// This must be done in order to keep compressed MatMul weights with decompression operations as is
ov::pass::Manager decompression_handling_manager("CPU:DecompressionHandling");
decompression_handling_manager.set_per_pass_validation(false);
CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::ConvertConvolutionToMatMul);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood, this transformation converts 1x1 Convolution with compressed weights to Transpose->MatMul->Transpose sequence: this allows to keep weights decompression subgraph unfolded, which is better from memory consumption perspective.

The main concern I have is that we may potentially introduce performance degradations for the quantized models. In this case, low precision transformations handles dequantization operations from activations and weights, so after the low precision pipeline we usually have u8 activations + i8 weights, so the target layer can be executed in low precision. But if earlier we had just one Convolution(Quantized) after low precision transformations, now we have Transpose->MatMul(Quantized)->Transpose, which may be worse from performance perspective.

From my perspective, it looks like we should avoid the transformation if the model is quantized. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed offline, added the supported_precisions parameter into the pattern to handle LPT.

…ulDecompressionSubgraph and cover more pattern test; use shape_matches; use attri match in wrap_type; add lpt supported precesion arguement in pattern
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants