You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user_guide/decoupled_models.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -97,7 +97,7 @@ each time with a new response. You can take a look at [grpc_server.cc](https://g
97
97
98
98
### Using Decoupled Models in Ensembles
99
99
100
-
When using decoupled models within an [ensemble](ensemble_models.md), you may encounter unbounded memory growth if a decoupled model produces responses faster than downstream models can consume them. To address this, Triton provides the `max_inflight_responses` configuration field, which limits the number of concurrent inflight responses at each ensemble step.
100
+
When using decoupled models within an [ensemble](ensemble_models.md), you may encounter unbounded memory growth if a decoupled model produces responses faster than downstream models can consume them, because these responses are buffered until the downstream models are ready to process them. To address this, Triton provides the `max_inflight_requests` configuration field, which limits the number of concurrent inflight requests at each ensemble step.
101
101
102
102
For more details and examples, see [Managing Memory Usage in Ensembles with Decoupled Models](ensemble_models.md#managing-memory-usage-in-ensembles-with-decoupled-models).
Copy file name to clipboardExpand all lines: docs/user_guide/ensemble_models.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -185,20 +185,20 @@ to find the optimal model configurations.
185
185
186
186
## Managing Memory Usage in Ensembles with Decoupled Models
187
187
188
-
An *inflight response* is an intermediate output generated by an upstream model and held in memory until it is consumed by a downstream model within an ensemble pipeline. When an ensemble pipeline contains [decoupled models](decoupled_models.md) that produce responses faster than downstream models can process them, inflight responses accumulate internally and may cause unbounded memory growth. This commonly occurs in data preprocessing pipelines where a fast decoupled model (such as DALI, which efficiently streams and preprocesses data) feeds into a slower inference model (such as ONNX Runtime or TensorRT, which are compute-intensive and operate at a lower throughput).
188
+
An *inflight request* is an intermediate request generated by an upstream model and held in memory until it is processed by a downstream model within an ensemble pipeline. When an ensemble pipeline contains [decoupled models](decoupled_models.md) that produce responses faster than downstream models can process them, inflight requests accumulate internally and may cause unbounded memory growth. This commonly occurs in data preprocessing pipelines where a fast decoupled model (such as DALI, which efficiently streams and preprocesses data) feeds into a slower inference model (such as ONNX Runtime or TensorRT, which are compute-intensive and operate at a lower throughput).
189
189
190
190
Consider an example ensemble model with two steps:
Here, the DALI model produces responses 10× faster than the ONNX model can process them. Without backpressure, these intermediate tensors accumulate in memory, eventually leading to out-of-memory errors.
195
195
196
-
The `max_inflight_responses` field in the ensemble configuration limits the number of concurrent inflight responses between ensemble steps per request.
196
+
The `max_inflight_requests` field in the ensemble configuration limits the number of concurrent inflight requests between ensemble steps per request.
197
197
When this limit is reached, faster upstream models are paused (blocked) until downstream models finish processing, effectively preventing unbounded memory growth.
198
198
199
199
```
200
200
ensemble_scheduling {
201
-
max_inflight_responses: 16
201
+
max_inflight_requests: 16
202
202
203
203
step [
204
204
{
@@ -218,30 +218,30 @@ ensemble_scheduling {
218
218
```
219
219
220
220
**Configuration:**
221
-
***`max_inflight_responses: 16`**: For each ensemble request (not globally), at most 16 responses from `dali_preprocess`
221
+
***`max_inflight_requests: 16`**: For each ensemble request (not globally), at most 16 requests from `dali_preprocess`
222
222
can wait for `onnx_inference` to process. Once this per-step limit is reached, `dali_preprocess` is blocked until the downstream step completes a response.
Use `max_inflight_responses` when your ensemble includes:
227
+
Use `max_inflight_requests` when your ensemble includes:
228
228
***Decoupled models** that produce multiple responses per request
229
229
***Speed mismatch**: Upstream models significantly faster than downstream models
230
230
***Memory constraints**: Limited GPU/CPU memory available
231
231
232
232
### Choosing the Right Value
233
233
234
-
The optimal value depends on your deployment configuration, including batch size, request rate, available memory, and throughput characteristics.:
234
+
The optimal value depends on your deployment configuration, including batch size, request rate, available memory, and throughput characteristics.
235
235
236
236
***Too low** (e.g., 1-2): The producer step is frequently blocked, underutilizing faster models
237
237
***Too high** (e.g., 1000+): Memory usage increases, reducing the effectiveness of backpressure
238
238
***Recommended**: Start with a small value and tune based on memory usage and throughput monitoring
239
239
240
240
### Performance Considerations
241
241
242
-
***Zero overhead when disabled**: If `max_inflight_responses: 0` (default),
242
+
***Zero overhead when disabled**: If `max_inflight_requests: 0` (default),
243
243
no synchronization overhead is incurred.
244
-
***Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight response limit is reached and resumed ("woken up") as downstream models consume responses. This synchronization ensures memory usage stays within bounds, though it may increase latency.
244
+
***Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight requests limit is reached and resumed ("woken up") as downstream models complete processing them. This synchronization ensures memory usage stays within bounds, though it may increase latency.
0 commit comments