-
Notifications
You must be signed in to change notification settings - Fork 606
Modify the position of npu sync ops in layerwise connector for better TTFT performance #4478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… TTFT performance
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve TTFT performance by repositioning an NPU synchronization operation. The changes remove synchronization calls from the KVCacheSendingLayerThread and add a new one in MooncakeLayerwiseConnectorWorker.wait_for_layer_load. While this is intended to improve the overlap between computation and data transfer, removing the synchronization before batch_transfer_sync_write introduces critical race conditions. The data for the KV cache transfer, which is produced by asynchronous operations (either model computation or explicit copy_), may not be ready when the transfer starts, potentially leading to data corruption. I have added comments highlighting these critical issues and suggesting to restore the synchronization.
| ret = self.engine.batch_transfer_sync_write( | ||
| session_id, src_list, dst_list, length_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-introducing this synchronization block is crucial to prevent a race condition. The KV cache data at the source addresses (src_list) is the result of asynchronous NPU computations. Without synchronization, batch_transfer_sync_write could execute before the layer's computation is finished, leading to the transfer of stale or incorrect data. The synchronization added in wait_for_layer_load is not sufficient to prevent this race, as it executes in a different thread.
| ret = self.engine.batch_transfer_sync_write( | |
| session_id, src_list, dst_list, length_list) | |
| if self.current_layer != layer_index: | |
| self.current_layer = layer_index | |
| self.model_stream.synchronize() | |
| ret = self.engine.batch_transfer_sync_write( | |
| session_id, src_list, dst_list, length_list) |
| ret = self.engine.batch_transfer_sync_write( | ||
| session_id, src_list, dst_list, length_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A synchronization is critical here to prevent a race condition. The copy_ operations on lines 201-202 are asynchronous. Without synchronize(), batch_transfer_sync_write can execute before the data is copied to self.k_buffer and self.v_buffer, causing transfer of incomplete or garbage data. Please restore the synchronization to ensure data integrity.
self.model_stream.synchronize()
ret = self.engine.batch_transfer_sync_write(
session_id, src_list, dst_list, length_list)
What this PR does / why we need it?
When testing with mooncake_layerwise_connector, it was found that in the case of long input sequences and high concurrency, the masking effect between layer-wise kv tranfer and prefill computing is very poor. Most layer-wise kv transfer are exposed, resulting in unsatisfactory TTFT performance.
After analysis, by modifying the NPU synchronization operation before the batch_transfer_sync_write function and placing it in the wait_for_layer_load function, the layer-wise kv transfer and prefill computing can be perfectly masked, and the TTFT performance is improved.
Does this PR introduce any user-facing change?
How was this patch tested?
vllm version: v0.11.0