Skip to content

Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457

Draft
wsttiger wants to merge 36 commits intoNVIDIA:mainfrom
wsttiger:add_realtime_ai_predecoder_host_side_gb200
Draft

Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457
wsttiger wants to merge 36 commits intoNVIDIA:mainfrom
wsttiger:add_realtime_ai_predecoder_host_side_gb200

Conversation

@wsttiger
Copy link
Collaborator

@wsttiger wsttiger commented Mar 4, 2026

Host-side AI predecoder pipeline with RealtimePipeline abstraction

Summary

Adds a complete host-side realtime decoding pipeline that pairs a GPU-based AI predecoder (TensorRT) with CPU-based PyMatching MWPM decoding, orchestrated by a new RealtimePipeline C++ abstraction that hides all low-level atomics and thread management from application code.

  • AI predecoder service (AIPreDecoderService / AIDecoderService): TensorRT inference wrapped in CUDA graphs with gateway kernels for ring buffer I/O, supporting ONNX model loading, dynamic batch dims, FP16, and engine caching
  • Host-side dispatcher: Spin-polling dispatcher that replaces the device-side persistent kernel, with dynamic worker pool, per-worker CUDA graph launch, idle bitmask scheduling, pre/post-launch callbacks (DMA input copy), and RPC-based function dispatch
  • RealtimePipeline abstraction: Encapsulates dispatcher thread, worker threads, consumer thread, and ring buffer management behind a clean factory/callback API (GpuStageFactory, CpuStageCallback, CompletionCallback), eliminating direct atomic access from application code
  • RingBufferInjector: Dedicated software injection class for test/replay workloads, with thread-safe compare_exchange_weak slot claiming and backpressure tracking
  • GPU-only mode: Pipeline can operate without CPU worker threads when no post-processing decoder is needed
  • Performance optimizations: Vectorized GPU copy kernels (uint4 loads), DMA-based output copy, out-of-order consumer harvesting, ARM memory ordering fixes (std::atomic acquire loads, __sync_synchronize fences)
  • Comprehensive defect fixes: All critical (C1–C4) and major (M1–M12) defects from code review addressed, including race conditions, premature flag writes, dynamic-shape TRT volume overflow, and vector<bool> UB
  • Test coverage: GTest suite with 21 tests (identity passthrough, correctness, host dispatcher, sustained throughput) plus full benchmark with configurable distance/rounds/injection rate

Key files

Area Files
Pipeline abstraction realtime/include/cudaq/realtime/pipeline.h, realtime/lib/pipeline/realtime_pipeline.cu
Host dispatcher realtime/include/.../host_dispatcher.h, realtime/lib/.../host_dispatcher.cu, host_dispatcher_capi.cu
Realtime C API realtime/include/.../cudaq_realtime.h, realtime/lib/.../cudaq_realtime_api.cpp
AI decoder services libs/qec/include/.../ai_decoder_service.h, ai_predecoder_service.h, libs/qec/lib/realtime/ai_decoder_service.cu, ai_predecoder_service.cu
Benchmark libs/qec/lib/realtime/test_realtime_predecoder_w_pymatching.cpp
Unit tests libs/qec/unittests/test_realtime_pipeline.cu, realtime/unittests/test_host_dispatcher.cu
Design docs docs/host_side_dispatcher_design_gemini.md, docs/hybrid_ai_predecoder_pipeline.md

Test plan

  • test_realtime_pipeline — 21 GTest cases pass (identity passthrough, multi-request correctness, shutdown, slot wraparound, sustained throughput)
  • test_realtime_predecoder_w_pymatching — end-to-end benchmark runs with d7, d13_r13, d13_r104 configs at various injection rates
  • test_host_dispatcher — host dispatcher unit tests pass
  • test_dispatch_kernel — dispatch kernel unit tests pass
  • Verify ARM (Grace) memory ordering correctness under sustained load (no stuck slots, no phantom completions)
  • Verify SKIP_TRT=1 passthrough mode works for CI environments without GPU/TRT

wsttiger and others added 29 commits February 18, 2026 20:51
…atch

Introduce the cudaq-realtime library under realtime/, providing
infrastructure for low-latency GPU-accelerated realtime coprocessing
between FPGA/CPU and GPU systems in the NVQLink architecture.

Key components:
- C-compatible host API (cudaq_realtime.h) with dispatch manager/dispatcher
  lifecycle management (create, configure ring buffers, start/stop)
- Persistent GPU dispatch kernel that polls a ring buffer for incoming
  RPC requests and dispatches to registered handlers via function table
  lookup using FNV-1a hashed function IDs
- Two dispatch modes: DeviceCallMode (direct __device__ function calls)
  and GraphLaunchMode (device-side cudaGraphLaunch with backpressure
  and single-launch guards, requires sm_80+)
- Two kernel synchronization strategies: RegularKernel (__syncthreads)
  and CooperativeKernel (grid-wide cooperative_groups sync)
- Schema-driven type system for RPC argument/result descriptors
- Shared library (libcudaq-realtime.so) for the host API and static
  library (libcudaq-realtime-dispatch.a) for GPU kernel device code
- GTest-based unit tests covering device-call dispatch, host API
  integration, and device-side graph launch

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce AIDecoderService and AIPreDecoderService in the QEC library,
enabling a hybrid realtime pipeline where GPU-side TensorRT inference
(predecoding) hands off results to CPU-side classical decoders like
PyMatching.

Key components:
- AIDecoderService: wraps TensorRT inference in a CUDA graph using a
  gateway kernel pattern (mailbox pointer indirection) to bridge the
  dispatch kernel's dynamic ring buffer addresses to TRT's fixed I/O
  buffers. Supports SKIP_TRT env var for testing without TensorRT.
- AIPreDecoderService: extends AIDecoderService with an N-deep pinned
  memory circular queue for GPU-to-CPU handoff, slot claim/release
  protocol (d_claimed_slot, d_inflight_flag), backpressure signaling
  via d_ready_flags/d_queue_idx, and poll_next_job/release_job API
  with proper acquire/release memory ordering
- ThreadPool utility with optional Linux CPU core pinning for
  low-latency PyMatching worker threads
- End-to-end integration test demonstrating the full hybrid pipeline:
  dispatcher -> 4x AIPreDecoderService GPU inference -> polling thread
  -> 4-worker PyMatching thread pool -> TX flag acknowledgment
- CMake integration to find TensorRT and build the test with CUDA
  separable compilation

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Upgrade the AI predecoder test from a dummy identity TRT engine to a
real d=7 r=7 surface code Z-type ONNX model. The service classes now
support ONNX→TRT engine compilation, multi-output tensor bindings, and
type-agnostic (INT32) I/O. The test fires 8 realistic syndrome payloads
through 4 GPU pre-decoders and verifies end-to-end residual detector
output handed off to simulated PyMatching workers.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Replace the simulated PyMatching worker with a real MWPM decoder
using the d=7 surface code's static Z parity check matrix via the
cudaq-qec decoder plugin system. The 336 residual detectors from
the AI predecoder are sliced into 14 spatial rounds of 24 Z-stabilizer
syndromes and decoded independently. A mutex protects the decoder
for thread safety across the 4-worker thread pool.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Extract hard-coded d=7 parameters into a PipelineConfig struct with
static factory methods for d=7, d=13, d=21, and d=31 surface codes.
Runtime config selection via command-line argument (d7|d13|d21|d31)
preserves existing d=7 functionality while enabling larger-distance
experiments. ONNX_MODEL_PATH replaced with ONNX_MODEL_DIR to support
per-config model filenames.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a reusable header-only latency and throughput tracker for
realtime decoding pipelines. Provides per-request submit/complete
timestamping, percentile statistics (p50/p90/p95/p99), and a formatted
report including wall time, throughput, and per-request breakdown.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
… requests

Enhance PipelineBenchmark to distinguish submitted vs completed requests,
report timeouts, and cap per-request output to 50 entries. Integrate it
into the predecoder pipeline test with per-request submit/complete
markers and spin-wait polling for accurate latency measurement. Increase
default total_requests from 20 to 100 across all distance configs.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Instrument the PyMatching worker with high-resolution timestamps to
measure decode time vs worker overhead. Report a breakdown showing
PyMatching decode, worker overhead, and GPU+dispatch+poll latency as
percentages of the total end-to-end pipeline, plus per-round latency.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Add engine caching: prefer a pre-built .engine file when available,
otherwise build from ONNX and save the engine for subsequent runs.
Replace the single mutex-protected PyMatching decoder with a pool of
per-worker decoder instances using thread-local index assignment,
eliminating lock contention in the decode path.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a streaming test mode alongside the existing batch mode,
activated via CLI (`stream [rate_us] [duration_s]`). The streaming mode
uses dedicated producer/consumer threads to simulate continuous FPGA
syndrome arrival with configurable inter-arrival rate, in-flight
throttling (capped to num_predecoders), backpressure tracking, and
warmup period exclusion from latency stats. Reports steady-state
throughput, percentile latencies, and per-round timing breakdown.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…nt kernel

The CUDA device runtime has a hardcoded 128 fire-and-forget graph launch
slot limit that is never reclaimed while a persistent parent kernel runs,
making the device-side dispatcher unsuitable for sustained operation.

This adds a host-side CPU dispatcher thread that polls rx_flags and calls
cudaGraphLaunch from host code on per-predecoder CUDA streams, bypassing
the device runtime limit entirely. Streaming mode uses the host dispatcher;
batch mode retains the device-side dispatcher for backward compatibility.

Key changes:
- New host_dispatcher.h/.cpp with host_dispatcher_loop()
- AIPreDecoderService::capture_graph() gains device_launch flag for
  conditional cudaGraphInstantiateFlagDeviceLaunch vs standard instantiation
- d_queue_idx_ changed from cudaMalloc to cudaHostAllocMapped so the host
  dispatcher can read backpressure state without cudaMemcpy
- Mailbox bank changed to mapped pinned memory for zero-copy host writes
- Streaming test uses host dispatcher with per-predecoder streams

Verified: d7 streaming 16,824 requests (219 us mean, 31 us/round),
d13 streaming 6,227 requests (455 us mean, 35 us/round), zero errors.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…actor

- Add host dispatcher with dynamic worker pool (idle_mask, inflight_slot_tags)
  to avoid head-of-line blocking; use libcu++ system-scope atomics for
  rx/tx/ready flags and mapped pinned memory.
- Extend AIPreDecoderService and PreDecoderJob with origin_slot for
  out-of-order completion; default queue_depth 1 for host dispatch.
- Add design doc (host_side_dispatcher_design_gemini.md) with
  spin-polling dispatcher and worker pseudocode/constraints.
- Refactor test_realtime_predecoder_w_pymatching for dynamic pool and
  update CMakeLists; adjust nvqlink daemon and dispatch_kernel for
  host-side dispatch.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
This commit fundamentally redesigns the host-side execution model to achieve
microsecond-level latency, shifting from a general-purpose thread pool to
a strict, pinned, and lock-free architecture.

Key architectural changes in `test_realtime_predecoder_w_pymatching.cpp`:

1. Dedicated Polling Threads (Removed Thread Pool)
   - Replaced `cudaq::qec::utils::ThreadPool` and the single `incoming_thread`
     with a vector of dedicated `std::thread` worker loops.
   - Eliminates queueing latency, mutex locking, and context switching
     overhead. Each worker thread now spins continuously checking for its own
     GPU completions.

2. Strict CPU Thread Pinning
   - Introduced `pin_thread_to_core` and `pin_current_thread_to_core` using
     the Linux `pthread_setaffinity_np` API.
   - Pinned the Dispatcher (Core 2), Producer (Core 3), Consumer (Core 4),
     and all Worker threads (Cores 10+) to ensure they never migrate, keeping
     their CPU caches perfectly warm.

3. High-Resolution Sub-Component Timing
   - Added tracking arrays (`dispatch_ts`, `poll_ts`, `debug_dispatch_ts_arr`)
     piped through `WorkerPoolContext` and `PreDecoderJob`.
   - Updated end-of-run reporting to calculate differences between timestamps,
     proving that Host Dispatch overhead is negligible (~1-3µs) and the
     bottleneck is the GPU inference itself.

4. PyMatching Data Conversion Optimization
   - Inside `pymatching_worker_task`, replaced the conversion of `int32_t`
     syndrome data into a `std::vector<double>`.
   - Now populates a pre-allocated `cudaqx::tensor<uint8_t>` to avoid slow
     double-precision conversions inside the latency-critical worker loop.

5. NVTX Profiling Markers
   - Included `<nvtx3/nvToolsExt.h>` and wrapped key blocks in
     `nvtxRangePushA` and `nvtxRangePop`.
   - Enables generation of `nsys` profiles to visually align CPU thread
     activity with GPU TensorRT execution.

Other changes:
- Enable TensorRT FP16 builder flag (`kFP16`) in `ai_decoder_service.cu`
  for supported platforms to accelerate GPU inference.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…e directory. Refactored the pymatching demo code to use the updated functions in realtime.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…nstrumentation

Remove the duplicate QEC-local host_dispatcher.{h,cpp} in favor of
the canonical realtime library versions, eliminating link ambiguity.

Fix three correctness/performance bugs in the streaming pipeline:
- Consumer was strict in-order, causing 327 µs head-of-line blocking
  when parallel workers complete out of order. Changed to scan all
  active slots and harvest whichever are ready.
- Dispatcher set tx_flags=READY immediately on graph launch (when
  tx_data_host was non-null), causing phantom completions. Set
  tx_data_host/dev to nullptr so dispatcher uses IN_FLIGHT sentinel.
- Race between consumer clearing tx_flags and resetting slot_request:
  producer could see slot available and write slot_request before the
  consumer's slot_request=-1, permanently orphaning the slot. Fixed
  by resetting slot_request before clearing tx_flags with a store
  fence (__sync_synchronize) for ARM memory ordering.

Replace broken timing breakdown (dispatch_ts was always 0, making the
entire report show "Other/Misc Wait") with a 3-stage per-request
breakdown: [A] submit→worker poll, [B] worker task, [C] consumer
poll lag, with p50/p99 percentiles.

Also: reduce NUM_SLOTS 64→16 to cut queuing delay, remove unused
queue_depth from PipelineConfig, add DISABLE_PYMATCHING conditional
compilation, add stuck-request diagnostics, and remove batch mode /
watchdog / dead code.

Results (d7, 8 workers, open-loop):
  62.5K req/s, 230 µs mean latency, 500K/500K completed, 0 drops.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Increase num_predecoders and num_workers from 8 to 16 across all
config presets, and set NUM_SLOTS to 32. With 8 workers the pipeline
capacity (~24K req/s) was below the 33K req/s arrival rate at 30 µs
spacing, causing unbounded queuing and p99 latency spikes to 4.9 ms.

With 16 workers and 32 slots, d13 at 30 µs arrival sustains 25K req/s
with 299 µs mean latency (23 µs/round), p99 = 334 µs, and near-zero
backpressure (9K stalls vs 38M previously).

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Add optimization profile in build_engine_from_onnx for ONNX models
with dynamic dimensions (batch dim = 0). When detected, pin all
dynamic dims to 1 via min/opt/max profile so TensorRT can build
the engine. Previously these models failed with "Failed to build
TRT engine from ONNX".

Switch d13 config to predecoder_memory_d13_T13_X.onnx, which takes
detectors as input rather than raw measurements. End-to-end latency
drops from 299 µs to 226 µs, mainly from PyMatching (69 µs → 12 µs).

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Replace byte-by-byte memory copies with wider load/store operations
to reduce memory transactions in the CUDA graph. The input kernel now
uses uint32_t (4-byte) copies, the passthrough kernel uses uint4
(16-byte) copies, and the output kernel is replaced entirely with
cudaMemcpyAsync (DMA copy engine) followed by a minimal 1-thread
signal kernel. Thread counts bumped from 128 to 256.

Reduces d13 mean end-to-end latency from 226 µs to 141 µs (~85 µs)
and per-round latency from 17.4 µs to 10.8 µs.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Move the input copy from an SM-based kernel inside the CUDA graph to a
host-issued cudaMemcpyAsync via a new pre_launch_fn callback on
HostDispatchWorker. This frees GPU compute units for TRT inference and
reduces Stage A latency by ~19 µs. Add get_trt_input_ptr() and
get_host_ring_ptrs() accessors to support the callback wiring.

Separate the T104 ONNX model into its own d13_r104 pipeline config
(104 rounds, 32K slots) and restore d13_r13 to use the T13 model.

Update design document to reflect DMA data movement, pre-launch
callback, out-of-order consumer, and ARM memory ordering constraints.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Introduce a RealtimePipeline class (pipeline.h, realtime_pipeline.cu)
that encapsulates all ring buffer allocation, atomic synchronization,
dispatcher wiring, worker thread management, and consumer slot lifecycle
behind a callback-driven API. Application code provides a GPU stage
factory, a CPU stage callback, and a completion handler -- zero direct
atomic access required.

Refactor test_realtime_predecoder_w_pymatching.cpp from 1083 lines to
~470 lines by replacing inline atomics, thread management, and slot
tracking with pipeline.submit() / pipeline.stop() / pipeline.stats().
Add d13_r104 config (T=104 model, 131K slot size).

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
21 tests covering AIDecoderService, AIPreDecoderService, and the
host-side dispatcher. Correctness tests push 5,000 random 6.4 KB
payloads through the full CUDA graph pipeline and verify bitwise
identity. Integration tests exercise multi-predecoder concurrency
and sustained throughput (200 requests, regression for the 128-launch
limit fix). SKIP_TRT buffer size increased to 1600 floats to match
realistic syndrome payload sizes.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
The legacy predecoder_input_kernel and its cudaq::nvqlink includes
are no longer used since input data arrives via the pre_launch DMA
callback. Design doc updated to reflect current code: removed kernel
deletion, RealtimePipeline scaffolding, test suite, and SKIP_TRT
buffer size (1600 floats).

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…provements

Add GPU-only pipeline mode that skips CPU worker threads when no
cpu_stage callback is registered, using cudaLaunchHostFunc for
completion signaling instead. Add post_launch_fn/post_launch_data
callback to HostDispatchWorker and GpuWorkerResources, called after
successful cudaGraphLaunch. Rename CpuStageContext fields to
gpu_output/gpu_output_size and AIPreDecoderService buffers to
h_predecoder_outputs_/d_predecoder_outputs_ for clarity.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Address all critical (C1-C4) and major (M1-M12) defects identified
during code review:

Critical fixes:
- C1: Fix race condition in try_submit via compare_exchange_weak
- C2: Use uint64_t + separate occupancy flag for slot_request to
  support full request_id range (was int64_t with -1 sentinel)
- C3: Add __syncthreads() before response header write in
  gateway_output_kernel to prevent partially-written result reads
- C4: Always write IN_FLIGHT sentinel to tx_flags after graph launch

Major fixes:
- M1: Remove cudaSetDeviceFlags from RingBufferManager (caller's duty)
- M2: Use std::atomic load with memory_order_acquire for tx/rx flag
  reads instead of plain volatile (ARM correctness)
- M3: Validate num_workers <= 64 (idle_mask capacity)
- M4: Validate gpu_factory is set before start()
- M5: Check producer_stop in RingBufferInjector::submit to prevent
  infinite spin after shutdown
- M6: Make started flag std::atomic<bool>
- M7: Add CUDA error checks in AIDecoderService::capture_graph
- M8: Check enqueueV3 return value in both service files
- M9: Fix tensor_volume for dynamic-shape dims (was wrapping to
  SIZE_MAX on dim=-1)
- M10: Assert num_workers == num_predecoders in benchmark
- M11: Add aarch64 paths to predecoder test's TRT CMake search
- M12: Replace vector<bool> with vector<uint8_t> to avoid concurrent
  write UB

Also extracts submit logic into RingBufferInjector class to separate
test infrastructure from pipeline core.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@wsttiger wsttiger requested a review from bmhowe23 March 4, 2026 02:26
@wsttiger wsttiger requested review from cketcham2333 and kvmto March 4, 2026 02:26
@wsttiger
Copy link
Collaborator Author

wsttiger commented Mar 4, 2026

The code in the top-level realtime directory is only temporary

@bmhowe23
Copy link
Collaborator

bmhowe23 commented Mar 4, 2026

Noting: the realtime stuff should not be reviewed in this PR. It should be reviewed in NVIDIA/cuda-quantum#4041 instead.

wsttiger added 7 commits March 4, 2026 17:39
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…exclusively

The realtime/ source tree is removed from the build. All CMake targets
(cudaq-realtime-pipeline, test_realtime_pipeline, and
test_realtime_predecoder_w_pymatching) now discover headers and libraries
from the CUDAQ_REALTIME_ROOT install prefix via find_path/find_library.

- Remove add_subdirectory(realtime) from top-level CMakeLists.txt
- Move pipeline.h to libs/qec/include/cudaq/qec/realtime/pipeline.h
- Move realtime_pipeline.cu to libs/qec/lib/realtime/
- Rewrite cudaq-realtime-pipeline target to link against installed libs
- Remove all in-tree TARGET cudaq-realtime branches from unittests CMake
- Migrate cudaq::nvqlink:: namespace references to cudaq::realtime::
- Update #include paths from cudaq/nvqlink/ to cudaq/realtime/
- Delete the entire realtime/ source tree (13.5k lines)

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…RPATH

Link cudaq-realtime-host-dispatch directly to the predecoder test to
resolve a missing symbol at runtime (RUNPATH is not transitive). Reorder
BUILD_RPATH so the local build directory is searched before the install
prefix, ensuring the freshly built pipeline library is loaded.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Update all cudaqx-side consumers to use the renamed C-compatible host
dispatcher types (cudaq_host_dispatcher_config_t, cudaq_host_dispatch_worker_t,
cudaq_host_dispatcher_loop) with opaque void* atomic fields and
pointer+count worker arrays. Fix uninitialized post_launch_fn causing
segfault in dispatcher tests.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Adapt RingBufferManager and RingBufferInjector to pass request_id and
ptp_timestamp through to the updated cudaq_host_ringbuffer_write_rpc_request
API (now writes the full 24-byte RPCHeader). Zero-initialize RPCHeader in
test_realtime_pipeline to avoid uninitialized fields.

Tune d13_r104 config to 16 slots / 4 workers based on benchmarking
(0 backpressure stalls, p50=169µs, p99=186µs).

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
The TRT engine for the predecoder model uses uint8 I/O, but
trt_dtype_size() was missing the kUINT8 case, falling through to the
default of 4 bytes. This caused 4x oversized buffer allocations, garbled
model input (int32 values read as uint8 by TRT), and misinterpreted
output — producing coin-flip LER (~0.50).

Three fixes bring the pipeline to verified-correct LER of 0.002:

1. Add nvinfer1::DataType::kUINT8 to trt_dtype_size() (returns 1).
   Corrects buffer sizes, input copy, and output interpretation.

2. Replace the CUDA-Q surface_code H_z parity matrix with the
   Stim-derived full spacetime check matrix (H) and observables
   matrix (O), loaded from binary files in --data-dir. This gives
   PyMatching the correct matching graph and lets it project edge
   corrections onto the logical observable.

3. Add --data-dir support for loading pre-generated Stim detector
   samples and ground-truth observables. The producer feeds real
   uint8 detector data through the pipeline, and a post-run
   correctness report compares decode results against ground truth.

Additional changes:
- Derive slot_size, residual_detectors, and spatial_slices from the
  TRT model bindings at runtime instead of hardcoding in PipelineConfig
- Read request_id from RPCHeader before overwriting with RPCResponse
- Track per-request decode_corrections and logical_pred via request_id
- Pre-allocate syndrome tensors with thread_local to avoid per-decode
  heap allocation
- Bump d13_r104 config to 8 workers / 32 slots for full-H decode
  latency headroom
- Guard QEC_CPU_RELAX macro against redefinition from host_dispatcher.h
- Print TRT binding dtype and element size in setup diagnostics

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants