Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
240a223
Update
pskiran1 Oct 13, 2025
974aa25
Update
pskiran1 Oct 13, 2025
337e0a7
Update
pskiran1 Oct 13, 2025
05dcb71
Update
pskiran1 Oct 13, 2025
4f379ed
Fix pre-commit
pskiran1 Oct 13, 2025
9ed216f
Fix pre-commit errors
pskiran1 Oct 13, 2025
78698fc
Update
pskiran1 Oct 13, 2025
8665a0d
Update
pskiran1 Oct 13, 2025
0258eda
Update
pskiran1 Oct 13, 2025
81561fd
Remove duplicate code and add request cancellation test
pskiran1 Oct 14, 2025
10dacec
Fix pre-commit
pskiran1 Oct 14, 2025
e2e48a3
Fix pre-commit
pskiran1 Oct 14, 2025
f8f1468
Update
pskiran1 Oct 14, 2025
3d8b848
Update
pskiran1 Oct 14, 2025
4a1a8fe
Improve model preparation
pskiran1 Oct 15, 2025
554e1b9
Update tests
pskiran1 Oct 17, 2025
b2ad735
Add documentation
pskiran1 Oct 17, 2025
977420a
Update copyright
pskiran1 Oct 17, 2025
c7a6abf
Apply suggestion from @yinggeh
pskiran1 Oct 23, 2025
673ec6a
Update docs/user_guide/decoupled_models.md
pskiran1 Oct 23, 2025
ce95e2f
Update docs/user_guide/ensemble_models.md
pskiran1 Oct 23, 2025
6612bb3
Update tests and docs
pskiran1 Oct 24, 2025
81be2ff
Update
pskiran1 Oct 24, 2025
d28c7bb
Update qa/L0_simple_ensemble/ensemble_backpressure_test.py
pskiran1 Oct 30, 2025
df6de1d
Update qa/L0_simple_ensemble/ensemble_backpressure_test.py
pskiran1 Oct 30, 2025
8645e4f
Update
pskiran1 Oct 30, 2025
beb8484
Update
pskiran1 Oct 30, 2025
801ab01
Update qa/L0_simple_ensemble/ensemble_backpressure_test.py
pskiran1 Oct 31, 2025
2ff1d0b
Fix typo
pskiran1 Oct 31, 2025
e5ed718
Update documentation
pskiran1 Oct 31, 2025
6d76e59
Update docs/user_guide/decoupled_models.md
pskiran1 Oct 31, 2025
c686cea
Update docs/user_guide/ensemble_models.md
pskiran1 Oct 31, 2025
70a457f
Update docs/user_guide/ensemble_models.md
pskiran1 Oct 31, 2025
3a922ad
Update
pskiran1 Oct 31, 2025
8cc7a94
Fix pre-commit
pskiran1 Oct 31, 2025
5be6ce4
Fix test case errors
pskiran1 Nov 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions docs/user_guide/decoupled_models.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -95,7 +95,15 @@ your application should be cognizant that the callback function you registered w
`TRITONSERVER_InferenceRequestSetResponseCallback` can be invoked any number of times,
each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc)

### Knowing When a Decoupled Inference Request is Complete
### Using Decoupled Models in Ensembles

When using decoupled models within an [ensemble pipeline](ensemble_models.md), you may encounter unbounded memory growth if the decoupled model produces responses faster than downstream models can consume them.

To prevent unbounded memory growth in this scenario, consider using the `max_inflight_requests` configuration field. This field limits the maximum number of concurrent inflight requests permitted at each ensemble step for each inference request.

For more details and examples, see [Managing Memory Usage in Ensemble Models](ensemble_models.md#managing-memory-usage-in-ensemble-models).

## Knowing When a Decoupled Inference Request is Complete

An inference request is considered complete when a response containing the
`TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag is received from a model/backend.
Expand Down
63 changes: 62 additions & 1 deletion docs/user_guide/ensemble_models.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -183,6 +183,67 @@ performance, you can use
[Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
to find the optimal model configurations.

## Managing Memory Usage in Ensemble Models

An *inflight request* refers to an intermediate request generated by an upstream model that is queued and held in memory until it is processed by a downstream model within an ensemble pipeline. When upstream models process requests significantly faster than downstream models, these in-flight requests can accumulate and potentially lead to unbounded memory growth. This problem occurs when there is a speed mismatch between different steps in the pipeline and is particularly common in *decoupled models* that produce multiple responses per request more quickly than downstream models can consume.

Consider an example ensemble model with two steps where the upstream model is 10× faster:
1. **Preprocessing model**: Produces 100 preprocessed requests/sec
2. **Inference model**: Consumes 10 requests/sec

Without backpressure, requests accumulate in the pipeline faster than they can be processed, eventually leading to out-of-memory errors.

The `max_inflight_requests` field in the ensemble configuration sets a limit on the number of concurrent inflight requests permitted at each ensemble step for a single inference request.
When this limit is reached, faster upstream models are paused (blocked) until downstream models finish processing, effectively preventing unbounded memory growth.

```
ensemble_scheduling {
max_inflight_requests: 16

step [
{
model_name: "dali_preprocess"
model_version: -1
input_map { key: "RAW_IMAGE", value: "IMAGE" }
output_map { key: "PREPROCESSED_IMAGE", value: "preprocessed" }
},
{
model_name: "onnx_inference"
model_version: -1
input_map { key: "INPUT", value: "preprocessed" }
output_map { key: "OUTPUT", value: "RESULT" }
}
]
}
```

**Configuration:**
* **`max_inflight_requests: 16`**: For each ensemble request (not globally), at most 16 requests from `dali_preprocess`
can wait for `onnx_inference` to process. Once this per-step limit is reached, `dali_preprocess` is blocked until the downstream step completes a response.
* **Default (`0`)**: No limit - allows unlimited inflight requests (original behavior).

### When to Use This Feature

Use `max_inflight_requests` when your ensemble pipeline includes:
* **Streaming or decoupled models**: When models produce multiple responses per request more quickly than downstream steps can process them.
* **Memory constraints**: Risk of unbounded memory growth from accumulating requests.

### Choosing the Right Value

The optimal value depends on your specific deployment, including batch size, request rate, available memory, and throughput.

* **Too low**: The producer step is blocked too often, which underutilizes faster models.
* **Too high**: Memory usage increases, diminishing the effectiveness of backpressure.
* **Recommendation**: Start with a small value and adjust it based on memory usage and throughput monitoring.

### Performance Considerations

* **Zero overhead when disabled**: If `max_inflight_requests: 0` (default),
no synchronization overhead is incurred.
* **Minimal overhead when enabled**: Uses a blocking/wakeup mechanism per ensemble step, where upstream models are paused ("blocked") when the inflight requests limit is reached and resumed ("woken up") as downstream models complete processing them. This synchronization ensures memory usage stays within bounds, though it may increase latency.

**Note**: This blocking does not cancel or internally time out intermediate requests, but clients may experience increased end-to-end latency.

## Additional Resources

You can find additional end-to-end ensemble examples in the links below:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


import numpy as np
import triton_python_backend_utils as pb_utils


class TritonPythonModel:
"""
Decoupled model that produces N responses based on input value.
"""

def execute(self, requests):
for request in requests:
# Get input - number of responses to produce
in_tensor = pb_utils.get_input_tensor_by_name(request, "IN")
count = in_tensor.as_numpy()[0]

response_sender = request.get_response_sender()

# Produce 'count' responses, each with 0.5 as the output value
for i in range(count):
out_tensor = pb_utils.Tensor("OUT", np.array([0.5], dtype=np.float32))
response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
response_sender.send(response)

# Send final flag
response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)

return None
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


name: "decoupled_producer"
backend: "python"
max_batch_size: 0

input [
{
name: "IN"
data_type: TYPE_INT32
dims: [ 1 ]
}
]

output [
{
name: "OUT"
data_type: TYPE_FP32
dims: [ 1 ]
}
]

instance_group [
{
count: 1
kind: KIND_CPU
}
]

model_transaction_policy {
decoupled: true
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


platform: "ensemble"
max_batch_size: 0

input [
{
name: "IN"
data_type: TYPE_INT32
dims: [ 1 ]
}
]

output [
{
name: "OUT"
data_type: TYPE_FP32
dims: [ 1 ]
}
]

ensemble_scheduling {
step [
{
model_name: "decoupled_producer"
model_version: -1
input_map {
key: "IN"
value: "IN"
}
output_map {
key: "OUT"
value: "intermediate"
}
},
{
model_name: "slow_consumer"
model_version: -1
input_map {
key: "INPUT0"
value: "intermediate"
}
output_map {
key: "OUTPUT0"
value: "OUT"
}
}
]
}

Loading
Loading