-
Notifications
You must be signed in to change notification settings - Fork 117
feat: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models
#455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models
#455
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for a max_ensemble_inflight_responses parameter to prevent unbounded memory growth in ensemble models by implementing backpressure control. The feature limits concurrent inflight responses from ensemble steps to downstream consumers.
- Adds backpressure configuration parameter parsing with validation
- Implements producer blocking mechanism when downstream consumers are overloaded
- Tracks inflight response counts per step with proper synchronization
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/ensemble_scheduler/ensemble_scheduler.h | Adds max_inflight_responses_ field to EnsembleInfo struct |
| src/ensemble_scheduler/ensemble_scheduler.cc | Implements backpressure logic with tracking, blocking, and configuration parsing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns here. This change creates an array of mutex + condition-variables that independently track, what I assume are, producer/consumer channels.
this seems overly complex to me.
why not use a simple integer to track the number of active vs capacity, and a single mutex + cv to handle interactions with those values?
Finally, does this guard against output overflows, where too many requests have completed but downstream models are incapable to consuming those outputs?
|
Need documentation and show the use case. |
max_ensemble_inflight_responses parameter to prevent unbounded memory growth in ensemble modelsmax_inflight_responses parameter to prevent unbounded memory growth in ensemble models
…into spolisetty/tri-26-triton-dali-ensemble-model-memory-issue
max_inflight_responses parameter to prevent unbounded memory growth in ensemble modelsmax_inflight_requests parameter to prevent unbounded memory growth in ensemble models
Co-authored-by: Yingge He <[email protected]>
This PR adds support for a
max_inflight_requestsparameter to prevent unbounded memory growth in ensemble models by implementing backpressure control. The feature limits concurrent in-flight responses from ensemble steps to downstream consumers.Problem
When a fast decoupled producer (e.g., DALI video decoder generating 200 frames instantly) feeds a slow consumer (e.g., image classification taking 200ms per frame), responses pile up in memory waiting to be processed. This causes unbounded memory growth (25-35GB observed for a single request).
Solution
The new parameter blocks the producer when the downstream consumer has too many pending responses (configured limit reached), implementing backpressure control. Example configuration:
CI: triton-inference-server/server#8458