-
Notifications
You must be signed in to change notification settings - Fork 604
[Refactor] Remove redundant attention operator branches. #4455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v0.11.0-dev
Are you sure you want to change the base?
[Refactor] Remove redundant attention operator branches. #4455
Conversation
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention operator branches by unifying prefill logic into a single _forward_prefill method and removing support for 310P devices. The changes simplify the control flow in the main forward method, making the code cleaner and easier to maintain. My review focuses on the correctness and clarity of this refactoring. I've identified one high-severity issue related to an unused parameter in the new _forward_prefill method, which should be addressed to improve code quality.
| def _forward_prefill(self, | ||
| query: torch.Tensor, | ||
| key: torch.Tensor, | ||
| value: torch.Tensor, | ||
| kv_cache: Tuple[torch.Tensor], | ||
| attn_metadata: AscendMetadata, | ||
| output: torch.Tensor, | ||
| num_tokens=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output parameter in the _forward_prefill method is unused. The value passed to it is immediately shadowed by the assignment output, _ = torch_npu.npu_fused_infer_attention_score(...) on line 357. This is misleading as it suggests an in-place operation which is not happening, and the pre-allocated tensor is wasted.
To avoid confusion and make the code cleaner, the output parameter should be removed from the function signature. The call site at line 592 should also be updated to no longer pass this parameter.
def _forward_prefill(self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: Tuple[torch.Tensor],
attn_metadata: AscendMetadata,
num_tokens=0):Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.Licensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License.This file is a part of the vllm-ascend project.from dataclasses import dataclass import torch from vllm_ascend.attention.utils import (AscendCommonAttentionMetadata, from ..utils import weak_ref_tensors class AscendAttentionBackend(AttentionBackend): class AscendAttentionState(Enum): @DataClass class AscendAttentionMetadataBuilder: class AscendAttentionBackendImpl(AttentionImpl): def unified_ascend_attention_with_output( def unified_attention_with_output_fake( direct_register_custom_op( |
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
[Refactor] Remove redundant attention operator branches.
Reason: