Modify the position of npu sync ops in layerwise connector for better TTFT performance #4478

anysources · 2025-11-26T16:12:09Z

What this PR does / why we need it?

When testing with mooncake_layerwise_connector, it was found that in the case of long input sequences and high concurrency, the masking effect between layer-wise kv tranfer and prefill computing is very poor. Most layer-wise kv transfer are exposed, resulting in unsatisfactory TTFT performance.
After analysis, by modifying the NPU synchronization operation before the batch_transfer_sync_write function and placing it in the wait_for_layer_load function, the layer-wise kv transfer and prefill computing can be perfectly masked, and the TTFT performance is improved.

Does this PR introduce any user-facing change?

How was this patch tested?

vllm version: v0.11.0

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

… TTFT performance

github-actions · 2025-11-26T16:12:18Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to improve TTFT performance by repositioning an NPU synchronization operation. The changes remove synchronization calls from the KVCacheSendingLayerThread and add a new one in MooncakeLayerwiseConnectorWorker.wait_for_layer_load. While this is intended to improve the overlap between computation and data transfer, removing the synchronization before batch_transfer_sync_write introduces critical race conditions. The data for the KV cache transfer, which is produced by asynchronous operations (either model computation or explicit copy_), may not be ready when the transfer starts, potentially leading to data corruption. I have added comments highlighting these critical issues and suggesting to restore the synchronization.

gemini-code-assist · 2025-11-26T16:15:11Z

vllm_ascend/distributed/mooncake_layerwise_connector.py

            ret = self.engine.batch_transfer_sync_write(
                session_id, src_list, dst_list, length_list)


Re-introducing this synchronization block is crucial to prevent a race condition. The KV cache data at the source addresses (src_list) is the result of asynchronous NPU computations. Without synchronization, batch_transfer_sync_write could execute before the layer's computation is finished, leading to the transfer of stale or incorrect data. The synchronization added in wait_for_layer_load is not sufficient to prevent this race, as it executes in a different thread.

Suggested change

ret = self.engine.batch_transfer_sync_write(

session_id, src_list, dst_list, length_list)

if self.current_layer != layer_index:

self.current_layer = layer_index

self.model_stream.synchronize()

ret = self.engine.batch_transfer_sync_write(

session_id, src_list, dst_list, length_list)

gemini-code-assist · 2025-11-26T16:15:11Z

vllm_ascend/distributed/mooncake_layerwise_connector.py

            ret = self.engine.batch_transfer_sync_write(
                session_id, src_list, dst_list, length_list)


A synchronization is critical here to prevent a race condition. The copy_ operations on lines 201-202 are asynchronous. Without synchronize(), batch_transfer_sync_write can execute before the data is copied to self.k_buffer and self.v_buffer, causing transfer of incomplete or garbage data. Please restore the synchronization to ensure data integrity.

self.model_stream.synchronize() ret = self.engine.batch_transfer_sync_write( session_id, src_list, dst_list, length_list)

Modify the position of npu sync ops in layerwise connector for better…

26b87d7

… TTFT performance

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modify the position of npu sync ops in layerwise connector for better TTFT performance #4478

Modify the position of npu sync ops in layerwise connector for better TTFT performance #4478

anysources commented Nov 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Uh oh!

gemini-code-assist bot Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		ret = self.engine.batch_transfer_sync_write(
		session_id, src_list, dst_list, length_list)

Modify the position of npu sync ops in layerwise connector for better TTFT performance #4478

Are you sure you want to change the base?

Modify the position of npu sync ops in layerwise connector for better TTFT performance #4478

Conversation

anysources commented Nov 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anysources commented Nov 26, 2025 •

edited by github-actions bot

Loading