-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[OV CPU] add CPU pass ConvertConvolutionToMatMul into decompression_handling_manager #32983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[OV CPU] add CPU pass ConvertConvolutionToMatMul into decompression_handling_manager #32983
Conversation
… matmul. Then met MarkDequantization pattern
|
@v-Golubev , could you please review this PR? |
| namespace { | ||
| std::shared_ptr<Model> create_conv_function(const Shape& input_shape, | ||
| const Shape& weights_shape, | ||
| bool with_dequantization) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
| bool with_dequantization) { | |
| bool with_decompression) { |
As far as I see, this feature is targeted on decompressed weights + float activations scenario. Let's reflect it in the builder name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to decompression
| using namespace testing; | ||
|
|
||
| namespace { | ||
| std::shared_ptr<Model> create_conv_function(const Shape& input_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test cases with dynamic input_shape or modify the existing ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added dynamic shape testcase
| auto weights_convert = std::make_shared<op::v0::Convert>(weights_const, element::f32); | ||
| auto sub_const = op::v0::Constant::create(element::i4, {1}, {1}); | ||
| auto sub_convert = std::make_shared<op::v0::Convert>(sub_const, element::f32); | ||
| auto subtract = std::make_shared<op::v1::Subtract>(weights_convert, sub_convert); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ConvertConvolutionToMatMul matcher covers more complicated cases as well (e.g. zp_reshape or weights_sub_multiply_reshape). Can we cover them by tests?
You can try to reuse ov::test::utils::initMatMulDecompressionSubgraph to avoid code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reused initMatMulDecompressionSubgraph and covered more cases
| // weights should be static 1x1 kernel, [hidden_out, hidden_in, 1, 1] | ||
| const auto& weights_shape = conv_node->get_input_partial_shape(1); | ||
| if (weights_shape.is_dynamic()) { | ||
| return false; | ||
| } | ||
| if (weights_shape.size() != 4 || weights_shape[2] != 1 || weights_shape[3] != 1) { | ||
| return false; | ||
| } | ||
|
|
||
| // input should met: [seq_len, hidden_in, 1, 1], [1, hidden_in, 1, seq_len] or [1, hidden_in, seq_len, 1] | ||
| const auto& input_shape = conv_node->get_input_partial_shape(0); | ||
| if (input_shape.rank().get_length() != 4) { | ||
| return false; | ||
| } | ||
| const bool is_supported_shape = (input_shape[2] == 1 && input_shape[3] == 1) || | ||
| (input_shape[0] == 1 && input_shape[2] == 1) || | ||
| (input_shape[0] == 1 && input_shape[3] == 1); | ||
| if (!is_supported_shape) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shape checks can be done without custom predicate logic using ov::pass::pattern::shape_matches predicate which uses symbolics feature. Can we try to reuse it here to simplify the code?
Please see example in src/common/transformations/src/transformations/common_optimizations/fuse_rotary_positional_embeddings.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, reused shape_matches for the input shape.
| return conv_node->get_strides() == ov::Strides{1, 1} && conv_node->get_dilations() == ov::Strides{1, 1} && | ||
| conv_node->get_pads_begin() == ov::CoordinateDiff{0, 0} && | ||
| conv_node->get_pads_end() == ov::CoordinateDiff{0, 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Shouldn't we check
auto_padvalue here? - We can pass the desired attributes as a separate
wrap_typeparameter and avoid a custom predicate introduction. Could you please try this? As a reference, you can usemlp3_no_bias_swiglu_blockblock in src/common/transformations/src/transformations/common_optimizations/fuse_moe_experts.cpp: Attributes are set as last wrap_type parameter (e.g.{{"mode", "bidirectional"}}for broadcast node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, add auto_pad check, and use attri check with warp_type.
| ov::pass::Manager decompression_handling_manager("CPU:DecompressionHandling"); | ||
| decompression_handling_manager.set_per_pass_validation(false); | ||
| CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::ConvertConvolutionToMatMul); | ||
| CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::EliminateReshape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep InitNodeInfo first in the pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
position moved
| // This must be done in order to keep compressed MatMul weights with decompression operations as is | ||
| ov::pass::Manager decompression_handling_manager("CPU:DecompressionHandling"); | ||
| decompression_handling_manager.set_per_pass_validation(false); | ||
| CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::ConvertConvolutionToMatMul); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understood, this transformation converts 1x1 Convolution with compressed weights to Transpose->MatMul->Transpose sequence: this allows to keep weights decompression subgraph unfolded, which is better from memory consumption perspective.
The main concern I have is that we may potentially introduce performance degradations for the quantized models. In this case, low precision transformations handles dequantization operations from activations and weights, so after the low precision pipeline we usually have u8 activations + i8 weights, so the target layer can be executed in low precision. But if earlier we had just one Convolution(Quantized) after low precision transformations, now we have Transpose->MatMul(Quantized)->Transpose, which may be worse from performance perspective.
From my perspective, it looks like we should avoid the transformation if the model is quantized. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed offline, added the supported_precisions parameter into the pattern to handle LPT.
…ulDecompressionSubgraph and cover more pattern test; use shape_matches; use attri match in wrap_type; add lpt supported precesion arguement in pattern
Details:
ConvertConvolutionToMatMulto convert the 1x1 kernel conv to matmul, then reuseMarkDequantizationand related patternsTickets: