-
Notifications
You must be signed in to change notification settings - Fork 661
[Feature] Support stopping the inference for the corresponding request in the online service after a disconnection request. #5320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5320 +/- ##
==========================================
Coverage ? 59.64%
==========================================
Files ? 325
Lines ? 40364
Branches ? 6110
==========================================
Hits ? 24076
Misses ? 14399
Partials ? 1889
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a7380f6 to
7480fb7
Compare
Jiang-Jia-Jun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的abort看起来只终止了内存队列里面的请求。 对于请求已经正在生成的没有做处理
fastdeploy/engine/common_engine.py
Outdated
| ) | ||
| self.token_processor._recycle_resources( | ||
| req_id, batch_id, abort_task, abort_res, is_prefill, True | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要考虑引擎的异步性:
- resource_manager 停止调度该请求
- 确认引擎对应的槽位已经生成完所有的 token
- resource manager 回收该请求调度的资源
Motivation
In the current code structure, when a request disconnects from the online inference service, the resources and ongoing inference occupied by that request in FastDeploy are not immediately terminated or released. Therefore, corresponding logic needs to be implemented to capture the request interruption and release the associated resources to stop the inference.
Modifications
API Layer Unification (API 层的统一引入)
utilsmodule, add or enhance thewith_cancellationdecorator to uniformly handle the cancellation logic for all requests (e.g., listening for HTTP disconnection).@with_cancellationdecorator uniformly to all external interfaces in the API layer (e.g.,chat,completions,score, etc.).Exception Handling and Notification (异常捕获与通知)
asyncio.CancelledErrorexception thrown by the@with_cancellationdecorator.Engine Layer Cancellation Handling (EngineService) (引擎层中断处理)
EngineServicemodifies its receiving logic in the existing request thread to receive the cancellation request message via ZMQ.abort_requests, specifically responsible for handling the received cancellation requests, implementing the stopping of inference and resource reclamation.Resource Cleanup and Scheduling (
abort_requestsDetails) (资源清理与调度)abort_requestsfunction, the following key steps will be executed to ensure the request is fully terminated and resources are recovered:RequestOutputresult and place it into the scheduler's output queue.tasks_listandstop_flagsstructures.Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.