Update tests and docs

pskiran1 · pskiran1 · commit 6612bb3c6a5c · 2025-10-24T21:09:32.000+05:30
diff --git a/docs/user_guide/decoupled_models.md b/docs/user_guide/decoupled_models.md
@@ -97,7 +97,7 @@ each time with a new response. You can take a look at [grpc_server.cc](https://g
 
 ### Using Decoupled Models in Ensembles
 
-When using decoupled models within an [ensemble](ensemble_models.md), you may encounter unbounded memory growth if a decoupled model produces responses faster than downstream models can consume them. To address this, Triton provides the `max_inflight_responses` configuration field, which limits the number of concurrent inflight responses at each ensemble step.
+When using decoupled models within an [ensemble](ensemble_models.md), you may encounter unbounded memory growth if a decoupled model produces responses faster than downstream models can consume them, because these responses are buffered until the downstream models are ready to process them. To address this, Triton provides the `max_inflight_requests` configuration field, which limits the number of concurrent inflight requests at each ensemble step.
 
 For more details and examples, see [Managing Memory Usage in Ensembles with Decoupled Models](ensemble_models.md#managing-memory-usage-in-ensembles-with-decoupled-models).
 
diff --git a/docs/user_guide/ensemble_models.md b/docs/user_guide/ensemble_models.md
@@ -185,20 +185,20 @@ to find the optimal model configurations.
 
 ## Managing Memory Usage in Ensembles with Decoupled Models
 
-An *inflight response* is an intermediate output generated by an upstream model and held in memory until it is consumed by a downstream model within an ensemble pipeline. When an ensemble pipeline contains [decoupled models](decoupled_models.md) that produce responses faster than downstream models can process them, inflight responses accumulate internally and may cause unbounded memory growth. This commonly occurs in data preprocessing pipelines where a fast decoupled model (such as DALI, which efficiently streams and preprocesses data) feeds into a slower inference model (such as ONNX Runtime or TensorRT, which are compute-intensive and operate at a lower throughput).
+An *inflight request* is an intermediate request generated by an upstream model and held in memory until it is processed by a downstream model within an ensemble pipeline. When an ensemble pipeline contains [decoupled models](decoupled_models.md) that produce responses faster than downstream models can process them, inflight requests accumulate internally and may cause unbounded memory growth. This commonly occurs in data preprocessing pipelines where a fast decoupled model (such as DALI, which efficiently streams and preprocesses data) feeds into a slower inference model (such as ONNX Runtime or TensorRT, which are compute-intensive and operate at a lower throughput).
 
 Consider an example ensemble model with two steps:
 1. **DALI preprocessor** (decoupled): Produces 100 preprocessed images/sec
 2. **ONNX inference model**: Consumes 10 images/sec
 
 Here, the DALI model produces responses 10× faster than the ONNX model can process them. Without backpressure, these intermediate tensors accumulate in memory, eventually leading to out-of-memory errors.
 
-The `max_inflight_responses` field in the ensemble configuration limits the number of concurrent inflight responses between ensemble steps per request.
+The `max_inflight_requests` field in the ensemble configuration limits the number of concurrent inflight requests between ensemble steps per request.
 When this limit is reached, faster upstream models are paused (blocked) until downstream models finish processing, effectively preventing unbounded memory growth.
 
 ```
 ensemble_scheduling {
-  max_inflight_responses: 16
+  max_inflight_requests: 16
 
   step [
     {
@@ -218,30 +218,30 @@ ensemble_scheduling {
 ```
 
 **Configuration:**
-* **`max_inflight_responses: 16`**: For each ensemble request (not globally), at most 16 responses from `dali_preprocess`
+* **`max_inflight_requests: 16`**: For each ensemble request (not globally), at most 16 requests from `dali_preprocess`
   can wait for `onnx_inference` to process. Once this per-step limit is reached, `dali_preprocess` is blocked until the downstream step completes a response.
-* **Default (`0`)**: No limit - allows unlimited inflight responses (original behavior).
+* **Default (`0`)**: No limit - allows unlimited inflight requests (original behavior).
 
 ### When to Use This Feature
 
-Use `max_inflight_responses` when your ensemble includes:
+Use `max_inflight_requests` when your ensemble includes:
 * **Decoupled models** that produce multiple responses per request
 * **Speed mismatch**: Upstream models significantly faster than downstream models
 * **Memory constraints**: Limited GPU/CPU memory available
 
 ### Choosing the Right Value
 
-The optimal value depends on your deployment configuration, including batch size, request rate, available memory, and throughput characteristics.:
+The optimal value depends on your deployment configuration, including batch size, request rate, available memory, and throughput characteristics.
 
 * **Too low** (e.g., 1-2): The producer step is frequently blocked, underutilizing faster models
 * **Too high** (e.g., 1000+): Memory usage increases, reducing the effectiveness of backpressure
 * **Recommended**: Start with a small value and tune based on memory usage and throughput monitoring
 
 ### Performance Considerations
 
-* **Zero overhead when disabled**: If `max_inflight_responses: 0` (default),
+* **Zero overhead when disabled**: If `max_inflight_requests: 0` (default),
   no synchronization overhead is incurred.
-* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight response limit is reached and resumed ("woken up") as downstream models consume responses. This synchronization ensures memory usage stays within bounds, though it may increase latency.
+* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight requests limit is reached and resumed ("woken up") as downstream models complete processing them. This synchronization ensures memory usage stays within bounds, though it may increase latency.
 
 ## Additional Resources
 
diff --git a/qa/L0_simple_ensemble/backpressure_test_models/ensemble_disabled_max_inflight_requests/config.pbtxt b/qa/L0_simple_ensemble/backpressure_test_models/ensemble_disabled_max_inflight_requests/config.pbtxt
diff --git a/qa/L0_simple_ensemble/ensemble_backpressure_test.py b/qa/L0_simple_ensemble/ensemble_backpressure_test.py
@@ -44,8 +44,9 @@
 SERVER_URL = "localhost:8001"
 DEFAULT_RESPONSE_TIMEOUT = 60
 EXPECTED_INFER_OUTPUT = 0.5
-MODEL_ENSEMBLE_ENABLED = "ensemble_enabled_max_inflight_responses"
-MODEL_ENSEMBLE_DISABLED = "ensemble_disabled_max_inflight_responses"
+MODEL_ENSEMBLE_ENABLED = "ensemble_max_inflight_requests_limit_4"
+MODEL_ENSEMBLE_DISABLED = "ensemble_disabled_max_inflight_requests"
+MODEL_ENSEMBLE_LIMIT_ONE = "ensemble_max_inflight_requests_limit_1"
 
 
 class UserData:
@@ -62,7 +63,7 @@ def callback(user_data, result, error):
 
 class EnsembleBackpressureTest(tu.TestResultCollector):
     """
-    Tests for ensemble backpressure feature (max_inflight_responses).
+    Tests for ensemble backpressure feature (max_inflight_requests).
     """
 
     def _prepare_infer_args(self, input_value):
@@ -139,14 +140,20 @@ def _run_inference(self, model_name, expected_count):
 
     def test_backpressure_limits_inflight(self):
         """
-        Test that max_inflight_responses correctly limits concurrent
+        Test that max_inflight_requests correctly limits concurrent
         responses.
         """
         self._run_inference(model_name=MODEL_ENSEMBLE_ENABLED, expected_count=32)
+    
+    def test_backpressure_limit_one(self):
+        """
+        Test edge case: max_inflight_requests=1.
+        """
+        self._run_inference(model_name=MODEL_ENSEMBLE_LIMIT_ONE, expected_count=16)
 
     def test_backpressure_disabled(self):
         """
-        Test that an ensemble model without max_inflight_responses parameter works correctly.
+        Test that an ensemble model without max_inflight_requests parameter works correctly.
         """
         self._run_inference(model_name=MODEL_ENSEMBLE_DISABLED, expected_count=32)
 
@@ -202,7 +209,7 @@ def test_backpressure_request_cancellation(self):
         the client receives a cancellation error.
         """
         # Use a large count to ensure the producer gets blocked by backpressure.
-        # The model is configured with max_inflight_responses = 4.
+        # The model is configured with max_inflight_requests = 4.
         input_value = 32
         user_data = UserData()
 
diff --git a/qa/L0_simple_ensemble/test.sh b/qa/L0_simple_ensemble/test.sh
@@ -147,22 +147,31 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
-######## Test ensemble backpressure feature (max_inflight_responses parameter)
+######## Test ensemble backpressure feature (max_inflight_requests parameter)
 MODEL_DIR="`pwd`/backpressure_test_models"
-mkdir -p ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/1
+mkdir -p ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/1
 
 rm -rf ${MODEL_DIR}/slow_consumer
 mkdir -p ${MODEL_DIR}/slow_consumer/1
 cp ../python_models/ground_truth/model.py ${MODEL_DIR}/slow_consumer/1
 cp ../python_models/ground_truth/config.pbtxt ${MODEL_DIR}/slow_consumer/
 sed -i 's/name: "ground_truth"/name: "slow_consumer"/g' ${MODEL_DIR}/slow_consumer/config.pbtxt
 
-# Create ensemble with "max_inflight_responses = 4"
-rm -rf ${MODEL_DIR}/ensemble_enabled_max_inflight_responses
-mkdir -p ${MODEL_DIR}/ensemble_enabled_max_inflight_responses/1
-cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${MODEL_DIR}/ensemble_enabled_max_inflight_responses/
-sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_responses: 4/g' \
-  ${MODEL_DIR}/ensemble_enabled_max_inflight_responses/config.pbtxt
+# Create ensemble with "max_inflight_requests = 4"
+rm -rf ${MODEL_DIR}/ensemble_max_inflight_requests_limit_4
+mkdir -p ${MODEL_DIR}/ensemble_max_inflight_requests_limit_4/1
+cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${MODEL_DIR}/ensemble_max_inflight_requests_limit_4/
+sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_requests: 4/g' \
+  ${MODEL_DIR}/ensemble_max_inflight_requests_limit_4/config.pbtxt
+
+# Create ensemble with "max_inflight_requests = 1"
+rm -rf ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1
+mkdir -p ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/1
+cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/
+sed -i 's/platform: "ensemble"/name: "ensemble_max_inflight_requests_limit_1"\nplatform: "ensemble"/g' \
+  ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/config.pbtxt
+sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_requests: 1/g' \
+  ${MODEL_DIR}/ensemble_max_inflight_requests_limit_1/config.pbtxt
 
 BACKPRESSURE_TEST_PY=./ensemble_backpressure_test.py
 SERVER_LOG="./ensemble_backpressure_test_server.log"
@@ -182,7 +191,7 @@ python $BACKPRESSURE_TEST_PY -v >> $CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     RET=1
 else
-    check_test_results $TEST_RESULT_FILE 4
+    check_test_results $TEST_RESULT_FILE 5
     if [ $? -ne 0 ]; then
         cat $CLIENT_LOG
         echo -e "\n***\n*** Test Result Verification Failed\n***"
@@ -196,37 +205,37 @@ wait $SERVER_PID
 
 set +e
 # Verify valid config was loaded successfully
-if ! grep -q "Ensemble model 'ensemble_enabled_max_inflight_responses' configured with max_inflight_responses: 4" $SERVER_LOG; then
+if ! grep -q "Ensemble model 'ensemble_max_inflight_requests_limit_4' configured with max_inflight_requests: 4" $SERVER_LOG; then
     echo -e "\n***\n*** FAILED: Valid model did not load successfully\n***"
     RET=1
 fi
 set -e
 
 
-######## Test invalid value for "max_inflight_responses"
+######## Test invalid value for "max_inflight_requests"
 INVALID_PARAM_MODEL_DIR="`pwd`/invalid_param_test_models"
 SERVER_ARGS="--model-repository=${INVALID_PARAM_MODEL_DIR}"
-SERVER_LOG="./invalid_max_inflight_responses_server.log"
+SERVER_LOG="./invalid_max_inflight_requests_server.log"
 rm -rf $SERVER_LOG ${INVALID_PARAM_MODEL_DIR}
 
 mkdir -p ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/1
 mkdir -p ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/1
 mkdir -p ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/1
 cp -r ${MODEL_DIR}/decoupled_producer ${MODEL_DIR}/slow_consumer ${INVALID_PARAM_MODEL_DIR}/
 
-# max_inflight_responses = -5
-cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/
-sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_responses: -5/g' \
+# max_inflight_requests = -5
+cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/
+sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_requests: -5/g' \
   ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_negative_limit/config.pbtxt
 
-# max_inflight_responses = "invalid_value"
-cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/
-sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_responses: "invalid_value"/g' \
+# max_inflight_requests = "invalid_value"
+cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/
+sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_requests: "invalid_value"/g' \
   ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_string_limit/config.pbtxt
 
-# max_inflight_responses = 12345678901
-cp ${MODEL_DIR}/ensemble_disabled_max_inflight_responses/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/
-sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_responses: 12345678901/g' \
+# max_inflight_requests = 12345678901
+cp ${MODEL_DIR}/ensemble_disabled_max_inflight_requests/config.pbtxt ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/
+sed -i 's/ensemble_scheduling {/ensemble_scheduling {\n  max_inflight_requests: 12345678901/g' \
   ${INVALID_PARAM_MODEL_DIR}/ensemble_invalid_large_value_limit/config.pbtxt