Skip to content

Conversation

@dmatveev
Copy link
Contributor

@dmatveev dmatveev commented Oct 2, 2025

Details:

  • Yet another part of the bigger NPUW: Dynamic attention #32000
  • Introduces individual ATTENTION_HINTs for prefill & generate stages
  • In this PR, all DYNAMIC hint does is cancel the SDPA unroll & v-tensor transpose
  • When prefill & generate hints are different, the prefill-to-generate kvcache copy is in fact a permute (a new utility added)
  • When paired with NPUW: Introduce a pattern to handle SDPA #32283, the SDPA subgraph will go separate once hint is DYNAMIC

@dmatveev dmatveev requested review from a team as code owners October 2, 2025 21:21
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 2, 2025
@dmatveev dmatveev added this to the 2025.4 milestone Oct 2, 2025
uint32_t kv_dim_src,
uint32_t kv_dim_dst) {
if (kv_dim_src != kv_dim_dst) {
// new case - do a generic copy for now (in fact it is a permute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may leave it as is for now, but isn't first dimension is a batch and isn't it always == 1?

In this case we don't need 4d permute. And we have already covered and optimized most (but not all) cases for 3d permute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need a generic 4d permute for sure, but I didn't want to handle the particular cases here as it'd complicate the flow at this stage

}
if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) {
LOG_DEBUG("V-tensors tranposed in prefill model");
m_kvcache_desc.v_tensors_transposed_pre = true;
Copy link
Contributor

@AsyaPronina AsyaPronina Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a bit changes default behaviour: previously if transpose wasn't needed, then we applied SDPA unroll for generate model, but not for prefill. (As transpose returned false then we didn't go and apply the same transformations to prefill)

Now we apply unroll both to generate and prefill regardless if transpose transformation for generate returned true or false.

Is it expected? It seems more correct now, by the way

Copy link
Contributor Author

@dmatveev dmatveev Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what do you refer to as to the previous behavior.

SDPA unroll is a part of the optimize_value_tensors routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?

When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.

Or do you refer to the fact that optimize_value_tensors could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).

Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing -
EISW-186959. Thanks!

void ov::npuw::util::permute_i4d(const ov::SoPtr<ov::ITensor>& src,
ov::SoPtr<ov::ITensor>& dst,
const std::array<int, 4> order) {
const auto& src_shape = src->get_shape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need is_continuous() checks somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This permute respects strides so we don't

}
if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) {
LOG_DEBUG("V-tensors tranposed in prefill model");
m_kvcache_desc.v_tensors_transposed_pre = true;
Copy link
Contributor Author

@dmatveev dmatveev Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what do you refer to as to the previous behavior.

SDPA unroll is a part of the optimize_value_tensors routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?

When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.

Or do you refer to the fact that optimize_value_tensors could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).

Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing -
EISW-186959. Thanks!

};
const auto dst_o =
v_dst[0] * dst_s[0] + v_dst[1] * dst_s[1] + v_dst[2] * dst_s[2] + v_dst[3] * dst_s[3];
std::copy_n(src_p + src_o, elem_size, dst_p + dst_o);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on my paper math this approach look correct.
This permute assumes that two dimensions swap their places (order), so given the position in the source tensor (i/j/k/l) we find the right position in the dst tensor with the remapped vector (i/j/l/k) in this case. As the axis order defines the physical layout, we use the src/dst strides in the same order as given. Finally, we copy $(element_size) bytes for a single element with copy_n.

@dmatveev dmatveev self-assigned this Oct 3, 2025
@dmatveev dmatveev added this pull request to the merge queue Oct 3, 2025
Merged via the queue into openvinotoolkit:master with commit 0b1651a Oct 4, 2025
238 of 244 checks passed
@dmatveev dmatveev deleted the dm/hints_attn branch October 4, 2025 00:01
github-merge-queue bot pushed a commit that referenced this pull request Oct 9, 2025
### Details:
- This PR is currently decomposed into the smaller dependency parts
(merged):
   - #32268
   - #32270
   - #32283
   - #32284

### Tickets:
 - *EISW-182648*

---------

Signed-off-by: intelgaoxiong <[email protected]>
Co-authored-by: intelgaoxiong <[email protected]>
Co-authored-by: Alexey Smirnov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants