From 90c5a6871fbd217d73f7090db774a72c10152968 Mon Sep 17 00:00:00 2001
From: Scott Thornton <sthornton@nvidia.com>
Date: Fri, 9 Jan 2026 23:54:54 +0000
Subject: [PATCH] docs(qec): update trt_decoder documentation for recent
 enhancements

Update TRT decoder documentation to reflect features introduced in commits
eff6966, 2954c01, and 49bdde8, bringing it inline with other QEC decoder
documentation.

Key additions:

1. CUDA Graph Optimization (commit 2954c01):
   - Document new `use_cuda_graph` parameter (default: True)
   - Note ~20% performance improvement from CUDA graph optimization
   - Explain automatic fallback for models with dynamic shapes

2. Batch Processing Support (commit eff6966):
   - Document automatic batch size detection
   - Explain zero-padding behavior for single syndrome decode()
   - Clarify decode_batch() requirements for batch-size multiples

3. Real-Time Decoding Integration (commit 49bdde8):
   - Add comprehensive trt_decoder_config documentation
   - Include Python and C++ examples for real-time configuration
   - Document YAML serialization support
   - Add configuration reference in python_realtime_decoding_api.rst

4. Documentation Structure Improvements:
   - Add performance characteristics section
   - Add batch processing notes
   - Include cross-references to real-time decoding examples
   - Maintain consistency with nv-qldpc and sliding_window decoder docs

Files changed:
- docs/sphinx/api/qec/trt_decoder_api.rst: Added parameters, real-time
  config section, and performance notes
- docs/sphinx/api/qec/python_realtime_decoding_api.rst: Added
  trt_decoder_config class documentation
- docs/sphinx/examples_rst/qec/realtime_decoding.rst: Added TRT decoder
  to decoder selection section

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
---
 .../api/qec/python_realtime_decoding_api.rst  | 39 +++++++++++++++++++
 docs/sphinx/api/qec/trt_decoder_api.rst       | 25 ++++++++++++
 .../examples_rst/qec/realtime_decoding.rst    |  6 +++
 3 files changed, 70 insertions(+)

diff --git a/docs/sphinx/api/qec/python_realtime_decoding_api.rst b/docs/sphinx/api/qec/python_realtime_decoding_api.rst
index 99ddd7e6..2bff97ca 100644
--- a/docs/sphinx/api/qec/python_realtime_decoding_api.rst
+++ b/docs/sphinx/api/qec/python_realtime_decoding_api.rst
@@ -72,6 +72,45 @@ Configuration API
 
 The configuration API enables setting up decoders before circuit execution. Decoders are configured using YAML files or programmatically constructed configuration objects.
 
+Configuration Types
+^^^^^^^^^^^^^^^^^^^
+
+.. py:class:: cudaq_qec.trt_decoder_config
+
+   Configuration for TensorRT decoder in real-time decoding system.
+
+   **Attributes:**
+
+   .. py:attribute:: onnx_load_path
+      :type: Optional[str]
+
+      Path to ONNX model file. Mutually exclusive with engine_load_path.
+
+   .. py:attribute:: engine_load_path
+      :type: Optional[str]
+
+      Path to pre-built TensorRT engine file. Mutually exclusive with 
+      onnx_load_path.
+
+   .. py:attribute:: engine_save_path
+      :type: Optional[str]
+
+      Path to save built TensorRT engine for reuse.
+
+   .. py:attribute:: precision
+      :type: Optional[str]
+
+      Inference precision mode: "fp16", "bf16", "int8", "fp8", "tf32", 
+      "noTF32", or "best" (default).
+
+   .. py:attribute:: memory_workspace
+      :type: Optional[int]
+
+      Workspace memory size in bytes (default: 1073741824 = 1GB).
+
+Configuration Functions
+^^^^^^^^^^^^^^^^^^^^^^^^
+
 .. py:function:: cudaq_qec.configure_decoders(config)
 
    Configure decoders from a multi_decoder_config object.
diff --git a/docs/sphinx/api/qec/trt_decoder_api.rst b/docs/sphinx/api/qec/trt_decoder_api.rst
index 590243f9..a0394fb9 100644
--- a/docs/sphinx/api/qec/trt_decoder_api.rst
+++ b/docs/sphinx/api/qec/trt_decoder_api.rst
@@ -10,6 +10,11 @@
     architecture and supports various precision modes (FP16, BF16, INT8, FP8)
     to balance accuracy and speed.
 
+    Neural network-based decoders can be trained to perform syndrome decoding
+    for specific quantum error correction codes and noise models. The TRT decoder
+    provides a high-performance inference engine for these models, with automatic
+    CUDA graph optimization for reduced latency.
+
     Requires a CUDA-capable GPU and TensorRT installation. See the `CUDA-Q GPU
     Compatibility List
     <https://nvidia.github.io/cuda-quantum/latest/using/install/local_installation.html#dependencies-and-compatibility>`_
@@ -80,6 +85,13 @@
       only required to satisfy the decoder interface. You can pass any valid
       parity check matrix of appropriate dimensions.
 
+    .. note::
+      **Batch Processing**: The TRT decoder automatically handles batch size
+      optimization. Models trained with batch_size > 1 will receive
+      zero-padded inputs when using `decode()` on a single syndrome. When
+      using `decode_batch()`, provide syndromes in multiples of the model's
+      batch size for optimal performance.
+
     :param H: Parity check matrix (tensor format). Note: This parameter is not
               used by the TRT decoder but is required by the decoder interface.
     :param params: Heterogeneous map of parameters:
@@ -116,3 +128,16 @@
           engine building (defaults to 1GB = 1073741824 bytes). Larger workspaces
           may allow TensorRT to explore more optimization strategies.
 
+        - `use_cuda_graph` (bool): Enable CUDA graph optimization for improved
+          performance (defaults to True). CUDA graphs capture inference operations
+          and replay them with reduced kernel launch overhead, providing ~20%
+          speedup. The optimization is applied automatically on the first decode
+          call. Automatically disabled for models with dynamic shapes or
+          multiple optimization profiles. Set to False to force traditional
+          execution path.
+
+        - `batch_size` (automatic): The decoder automatically detects the model's
+          batch size from the first input dimension. For models with batch_size > 1,
+          the `decode()` method automatically zero-pads single syndromes to fill
+          the batch. The `decode_batch()` method requires the number of syndromes
+          to be an integral multiple of the model's batch size.
diff --git a/docs/sphinx/examples_rst/qec/realtime_decoding.rst b/docs/sphinx/examples_rst/qec/realtime_decoding.rst
index 0c14d180..b1c02de9 100644
--- a/docs/sphinx/examples_rst/qec/realtime_decoding.rst
+++ b/docs/sphinx/examples_rst/qec/realtime_decoding.rst
@@ -529,6 +529,12 @@ Decoder Selection
 ^^^^^^^^^^^^^^^^^
 The page `CUDA-Q QEC Decoders <https://nvidia.github.io/cudaqx/components/qec/introduction.html#pre-built-qec-decoders>`_ provides information about which decoders are compatible with real-time decoding.
 
+The TRT decoder (``trt_decoder``) can be configured for real-time decoding by specifying 
+``trt_decoder_config`` parameters. This is useful for neural network-based 
+decoders trained for specific codes and noise models. Note that TRT models 
+must be trained with the appropriate input/output dimensions matching the 
+syndrome and error spaces. See :ref:`trt_decoder_api_python` for detailed configuration options.
+
 Troubleshooting
 ---------------