Skip to content

Conversation

@anysources
Copy link

@anysources anysources commented Nov 26, 2025

What this PR does / why we need it?

When testing with mooncake_layerwise_connector, it was found that in the case of long input sequences and high concurrency, the masking effect between layer-wise kv tranfer and prefill computing is very poor. Most layer-wise kv transfer are exposed, resulting in unsatisfactory TTFT performance.
After analysis, by modifying the NPU synchronization operation before the batch_transfer_sync_write function and placing it in the wait_for_layer_load function, the layer-wise kv transfer and prefill computing can be perfectly masked, and the TTFT performance is improved.

Does this PR introduce any user-facing change?

How was this patch tested?

vllm version: v0.11.0

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve TTFT performance by repositioning an NPU synchronization operation. The changes remove synchronization calls from the KVCacheSendingLayerThread and add a new one in MooncakeLayerwiseConnectorWorker.wait_for_layer_load. While this is intended to improve the overlap between computation and data transfer, removing the synchronization before batch_transfer_sync_write introduces critical race conditions. The data for the KV cache transfer, which is produced by asynchronous operations (either model computation or explicit copy_), may not be ready when the transfer starts, potentially leading to data corruption. I have added comments highlighting these critical issues and suggesting to restore the synchronization.

Comment on lines 193 to 194
ret = self.engine.batch_transfer_sync_write(
session_id, src_list, dst_list, length_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Re-introducing this synchronization block is crucial to prevent a race condition. The KV cache data at the source addresses (src_list) is the result of asynchronous NPU computations. Without synchronization, batch_transfer_sync_write could execute before the layer's computation is finished, leading to the transfer of stale or incorrect data. The synchronization added in wait_for_layer_load is not sufficient to prevent this race, as it executes in a different thread.

Suggested change
ret = self.engine.batch_transfer_sync_write(
session_id, src_list, dst_list, length_list)
if self.current_layer != layer_index:
self.current_layer = layer_index
self.model_stream.synchronize()
ret = self.engine.batch_transfer_sync_write(
session_id, src_list, dst_list, length_list)

Comment on lines 243 to 244
ret = self.engine.batch_transfer_sync_write(
session_id, src_list, dst_list, length_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

A synchronization is critical here to prevent a race condition. The copy_ operations on lines 201-202 are asynchronous. Without synchronize(), batch_transfer_sync_write can execute before the data is copied to self.k_buffer and self.v_buffer, causing transfer of incomplete or garbage data. Please restore the synchronization to ensure data integrity.

            self.model_stream.synchronize()
            ret = self.engine.batch_transfer_sync_write(
                session_id, src_list, dst_list, length_list)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant