Skip to content

Commit 6612bb3

Browse files
committed
Update tests and docs
1 parent ce95e2f commit 6612bb3

File tree

5 files changed

+53
-37
lines changed

5 files changed

+53
-37
lines changed

docs/user_guide/decoupled_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ each time with a new response. You can take a look at [grpc_server.cc](https://g
9797

9898
### Using Decoupled Models in Ensembles
9999

100-
When using decoupled models within an [ensemble](ensemble_models.md), you may encounter unbounded memory growth if a decoupled model produces responses faster than downstream models can consume them. To address this, Triton provides the `max_inflight_responses` configuration field, which limits the number of concurrent inflight responses at each ensemble step.
100+
When using decoupled models within an [ensemble](ensemble_models.md), you may encounter unbounded memory growth if a decoupled model produces responses faster than downstream models can consume them, because these responses are buffered until the downstream models are ready to process them. To address this, Triton provides the `max_inflight_requests` configuration field, which limits the number of concurrent inflight requests at each ensemble step.
101101

102102
For more details and examples, see [Managing Memory Usage in Ensembles with Decoupled Models](ensemble_models.md#managing-memory-usage-in-ensembles-with-decoupled-models).
103103

docs/user_guide/ensemble_models.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -185,20 +185,20 @@ to find the optimal model configurations.
185185

186186
## Managing Memory Usage in Ensembles with Decoupled Models
187187

188-
An *inflight response* is an intermediate output generated by an upstream model and held in memory until it is consumed by a downstream model within an ensemble pipeline. When an ensemble pipeline contains [decoupled models](decoupled_models.md) that produce responses faster than downstream models can process them, inflight responses accumulate internally and may cause unbounded memory growth. This commonly occurs in data preprocessing pipelines where a fast decoupled model (such as DALI, which efficiently streams and preprocesses data) feeds into a slower inference model (such as ONNX Runtime or TensorRT, which are compute-intensive and operate at a lower throughput).
188+
An *inflight request* is an intermediate request generated by an upstream model and held in memory until it is processed by a downstream model within an ensemble pipeline. When an ensemble pipeline contains [decoupled models](decoupled_models.md) that produce responses faster than downstream models can process them, inflight requests accumulate internally and may cause unbounded memory growth. This commonly occurs in data preprocessing pipelines where a fast decoupled model (such as DALI, which efficiently streams and preprocesses data) feeds into a slower inference model (such as ONNX Runtime or TensorRT, which are compute-intensive and operate at a lower throughput).
189189

190190
Consider an example ensemble model with two steps:
191191
1. **DALI preprocessor** (decoupled): Produces 100 preprocessed images/sec
192192
2. **ONNX inference model**: Consumes 10 images/sec
193193

194194
Here, the DALI model produces responses 10× faster than the ONNX model can process them. Without backpressure, these intermediate tensors accumulate in memory, eventually leading to out-of-memory errors.
195195

196-
The `max_inflight_responses` field in the ensemble configuration limits the number of concurrent inflight responses between ensemble steps per request.
196+
The `max_inflight_requests` field in the ensemble configuration limits the number of concurrent inflight requests between ensemble steps per request.
197197
When this limit is reached, faster upstream models are paused (blocked) until downstream models finish processing, effectively preventing unbounded memory growth.
198198

199199
```
200200
ensemble_scheduling {
201-
max_inflight_responses: 16
201+
max_inflight_requests: 16
202202
203203
step [
204204
{
@@ -218,30 +218,30 @@ ensemble_scheduling {
218218
```
219219

220220
**Configuration:**
221-
* **`max_inflight_responses: 16`**: For each ensemble request (not globally), at most 16 responses from `dali_preprocess`
221+
* **`max_inflight_requests: 16`**: For each ensemble request (not globally), at most 16 requests from `dali_preprocess`
222222
can wait for `onnx_inference` to process. Once this per-step limit is reached, `dali_preprocess` is blocked until the downstream step completes a response.
223-
* **Default (`0`)**: No limit - allows unlimited inflight responses (original behavior).
223+
* **Default (`0`)**: No limit - allows unlimited inflight requests (original behavior).
224224

225225
### When to Use This Feature
226226

227-
Use `max_inflight_responses` when your ensemble includes:
227+
Use `max_inflight_requests` when your ensemble includes:
228228
* **Decoupled models** that produce multiple responses per request
229229
* **Speed mismatch**: Upstream models significantly faster than downstream models
230230
* **Memory constraints**: Limited GPU/CPU memory available
231231

232232
### Choosing the Right Value
233233

234-
The optimal value depends on your deployment configuration, including batch size, request rate, available memory, and throughput characteristics.:
234+
The optimal value depends on your deployment configuration, including batch size, request rate, available memory, and throughput characteristics.
235235

236236
* **Too low** (e.g., 1-2): The producer step is frequently blocked, underutilizing faster models
237237
* **Too high** (e.g., 1000+): Memory usage increases, reducing the effectiveness of backpressure
238238
* **Recommended**: Start with a small value and tune based on memory usage and throughput monitoring
239239

240240
### Performance Considerations
241241

242-
* **Zero overhead when disabled**: If `max_inflight_responses: 0` (default),
242+
* **Zero overhead when disabled**: If `max_inflight_requests: 0` (default),
243243
no synchronization overhead is incurred.
244-
* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight response limit is reached and resumed ("woken up") as downstream models consume responses. This synchronization ensures memory usage stays within bounds, though it may increase latency.
244+
* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight requests limit is reached and resumed ("woken up") as downstream models complete processing them. This synchronization ensures memory usage stays within bounds, though it may increase latency.
245245

246246
## Additional Resources
247247

qa/L0_simple_ensemble/ensemble_backpressure_test.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,9 @@
4444
SERVER_URL = "localhost:8001"
4545
DEFAULT_RESPONSE_TIMEOUT = 60
4646
EXPECTED_INFER_OUTPUT = 0.5
47-
MODEL_ENSEMBLE_ENABLED = "ensemble_enabled_max_inflight_responses"
48-
MODEL_ENSEMBLE_DISABLED = "ensemble_disabled_max_inflight_responses"
47+
MODEL_ENSEMBLE_ENABLED = "ensemble_max_inflight_requests_limit_4"
48+
MODEL_ENSEMBLE_DISABLED = "ensemble_disabled_max_inflight_requests"
49+
MODEL_ENSEMBLE_LIMIT_ONE = "ensemble_max_inflight_requests_limit_1"
4950

5051

5152
class UserData:
@@ -62,7 +63,7 @@ def callback(user_data, result, error):
6263

6364
class EnsembleBackpressureTest(tu.TestResultCollector):
6465
"""
65-
Tests for ensemble backpressure feature (max_inflight_responses).
66+
Tests for ensemble backpressure feature (max_inflight_requests).
6667
"""
6768

6869
def _prepare_infer_args(self, input_value):
@@ -139,14 +140,20 @@ def _run_inference(self, model_name, expected_count):
139140

140141
def test_backpressure_limits_inflight(self):
141142
"""
142-
Test that max_inflight_responses correctly limits concurrent
143+
Test that max_inflight_requests correctly limits concurrent
143144
responses.
144145
"""
145146
self._run_inference(model_name=MODEL_ENSEMBLE_ENABLED, expected_count=32)
147+
148+
def test_backpressure_limit_one(self):
149+
"""
150+
Test edge case: max_inflight_requests=1.
151+
"""
152+
self._run_inference(model_name=MODEL_ENSEMBLE_LIMIT_ONE, expected_count=16)
146153

147154
def test_backpressure_disabled(self):
148155
"""
149-
Test that an ensemble model without max_inflight_responses parameter works correctly.
156+
Test that an ensemble model without max_inflight_requests parameter works correctly.
150157
"""
151158
self._run_inference(model_name=MODEL_ENSEMBLE_DISABLED, expected_count=32)
152159

@@ -202,7 +209,7 @@ def test_backpressure_request_cancellation(self):
202209
the client receives a cancellation error.
203210
"""
204211
# Use a large count to ensure the producer gets blocked by backpressure.
205-
# The model is configured with max_inflight_responses = 4.
212+
# The model is configured with max_inflight_requests = 4.
206213
input_value = 32
207214
user_data = UserData()
208215

qa/L0_simple_ensemble/test.sh

Lines changed: 30 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -147,22 +147,31 @@ set -e
147147
kill $SERVER_PID
148148
wait $SERVER_PID
149149

150-
######## Test ensemble backpressure feature (max_inflight_responses parameter)
150+
######## Test ensemble backpressure feature (max_inflight_requests parameter)
151151
MODEL_DIR="`pwd`/backpressure_test_models"
152-
mkdir -p ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/1
152+
mkdir -p ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/1
153153

154154
rm -rf ${MODEL_DIR}/slow_consumer
155155
mkdir -p ${MODEL_DIR}/slow_consumer/1
156156
cp ../python_models/ground_truth/model.py ${MODEL_DIR}/slow_consumer/1
157157
cp ../python_models/ground_truth/config.pbtxt ${MODEL_DIR}/slow_consumer/
158158
sed -i 's/name: "ground_truth"/name: "slow_consumer"/g' ${MODEL_DIR}/slow_consumer/config.pbtxt
159159

160-
# Create ensemble with "max_inflight_responses = 4"
161-
rm -rf ${MODEL_DIR}/ensemble_enabled_max_inflight_responses
162-
mkdir -p ${MODEL_DIR}/ensemble_enabled_max_inflight_responses/1
163-
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${MODEL_DIR}/ensemble_enabled_max_inflight_responses/
164-
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_responses: 4/g' \
165-
${MODEL_DIR}/ensemble_enabled_max_inflight_responses/config.pbtxt
160+
# Create ensemble with "max_inflight_requests = 4"
161+
rm -rf ${MODEL_DIR}/ensemble_max_inflight_requests_limit_4
162+
mkdir -p ${MODEL_DIR}/ensemble_max_inflight_requests_limit_4/1
163+
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${MODEL_DIR}/ensemble_max_inflight_requests_limit_4/
164+
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_requests: 4/g' \
165+
${MODEL_DIR}/ensemble_max_inflight_requests_limit_4/config.pbtxt
166+
167+
# Create ensemble with "max_inflight_requests = 1"
168+
rm -rf ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1
169+
mkdir -p ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/1
170+
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/
171+
sed -i 's/platform: "ensemble"/name: "ensemble_max_inflight_requests_limit_1"\nplatform: "ensemble"/g' \
172+
${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/config.pbtxt
173+
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_requests: 1/g' \
174+
${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/config.pbtxt
166175

167176
BACKPRESSURE_TEST_PY=./ensemble_backpressure_test.py
168177
SERVER_LOG="./ensemble_backpressure_test_server.log"
@@ -182,7 +191,7 @@ python $BACKPRESSURE_TEST_PY -v >> $CLIENT_LOG 2>&1
182191
if [ $? -ne 0 ]; then
183192
RET=1
184193
else
185-
check_test_results $TEST_RESULT_FILE 4
194+
check_test_results $TEST_RESULT_FILE 5
186195
if [ $? -ne 0 ]; then
187196
cat $CLIENT_LOG
188197
echo -e "\n***\n*** Test Result Verification Failed\n***"
@@ -196,37 +205,37 @@ wait $SERVER_PID
196205

197206
set +e
198207
# Verify valid config was loaded successfully
199-
if ! grep -q "Ensemble model 'ensemble_enabled_max_inflight_responses' configured with max_inflight_responses: 4" $SERVER_LOG; then
208+
if ! grep -q "Ensemble model 'ensemble_max_inflight_requests_limit_4' configured with max_inflight_requests: 4" $SERVER_LOG; then
200209
echo -e "\n***\n*** FAILED: Valid model did not load successfully\n***"
201210
RET=1
202211
fi
203212
set -e
204213

205214

206-
######## Test invalid value for "max_inflight_responses"
215+
######## Test invalid value for "max_inflight_requests"
207216
INVALID_PARAM_MODEL_DIR="`pwd`/invalid_param_test_models"
208217
SERVER_ARGS="--model-repository=${INVALID_PARAM_MODEL_DIR}"
209-
SERVER_LOG="./invalid_max_inflight_responses_server.log"
218+
SERVER_LOG="./invalid_max_inflight_requests_server.log"
210219
rm -rf $SERVER_LOG ${INVALID_PARAM_MODEL_DIR}
211220

212221
mkdir -p ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/1
213222
mkdir -p ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/1
214223
mkdir -p ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/1
215224
cp -r ${MODEL_DIR}/decoupled_producer ${MODEL_DIR}/slow_consumer ${INVALID_PARAM_MODEL_DIR}/
216225

217-
# max_inflight_responses = -5
218-
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/
219-
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_responses: -5/g' \
226+
# max_inflight_requests = -5
227+
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/
228+
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_requests: -5/g' \
220229
${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/config.pbtxt
221230

222-
# max_inflight_responses = "invalid_value"
223-
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/
224-
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_responses: "invalid_value"/g' \
231+
# max_inflight_requests = "invalid_value"
232+
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/
233+
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_requests: "invalid_value"/g' \
225234
${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/config.pbtxt
226235

227-
# max_inflight_responses = 12345678901
228-
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/
229-
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_responses: 12345678901/g' \
236+
# max_inflight_requests = 12345678901
237+
cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/
238+
sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n max_inflight_requests: 12345678901/g' \
230239
${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/config.pbtxt
231240

232241

0 commit comments

Comments
 (0)