Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457
Draft
wsttiger wants to merge 36 commits intoNVIDIA:mainfrom
Draft
Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457wsttiger wants to merge 36 commits intoNVIDIA:mainfrom
wsttiger wants to merge 36 commits intoNVIDIA:mainfrom
Conversation
…atch Introduce the cudaq-realtime library under realtime/, providing infrastructure for low-latency GPU-accelerated realtime coprocessing between FPGA/CPU and GPU systems in the NVQLink architecture. Key components: - C-compatible host API (cudaq_realtime.h) with dispatch manager/dispatcher lifecycle management (create, configure ring buffers, start/stop) - Persistent GPU dispatch kernel that polls a ring buffer for incoming RPC requests and dispatches to registered handlers via function table lookup using FNV-1a hashed function IDs - Two dispatch modes: DeviceCallMode (direct __device__ function calls) and GraphLaunchMode (device-side cudaGraphLaunch with backpressure and single-launch guards, requires sm_80+) - Two kernel synchronization strategies: RegularKernel (__syncthreads) and CooperativeKernel (grid-wide cooperative_groups sync) - Schema-driven type system for RPC argument/result descriptors - Shared library (libcudaq-realtime.so) for the host API and static library (libcudaq-realtime-dispatch.a) for GPU kernel device code - GTest-based unit tests covering device-call dispatch, host API integration, and device-side graph launch Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce AIDecoderService and AIPreDecoderService in the QEC library, enabling a hybrid realtime pipeline where GPU-side TensorRT inference (predecoding) hands off results to CPU-side classical decoders like PyMatching. Key components: - AIDecoderService: wraps TensorRT inference in a CUDA graph using a gateway kernel pattern (mailbox pointer indirection) to bridge the dispatch kernel's dynamic ring buffer addresses to TRT's fixed I/O buffers. Supports SKIP_TRT env var for testing without TensorRT. - AIPreDecoderService: extends AIDecoderService with an N-deep pinned memory circular queue for GPU-to-CPU handoff, slot claim/release protocol (d_claimed_slot, d_inflight_flag), backpressure signaling via d_ready_flags/d_queue_idx, and poll_next_job/release_job API with proper acquire/release memory ordering - ThreadPool utility with optional Linux CPU core pinning for low-latency PyMatching worker threads - End-to-end integration test demonstrating the full hybrid pipeline: dispatcher -> 4x AIPreDecoderService GPU inference -> polling thread -> 4-worker PyMatching thread pool -> TX flag acknowledgment - CMake integration to find TensorRT and build the test with CUDA separable compilation Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Upgrade the AI predecoder test from a dummy identity TRT engine to a real d=7 r=7 surface code Z-type ONNX model. The service classes now support ONNX→TRT engine compilation, multi-output tensor bindings, and type-agnostic (INT32) I/O. The test fires 8 realistic syndrome payloads through 4 GPU pre-decoders and verifies end-to-end residual detector output handed off to simulated PyMatching workers. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Replace the simulated PyMatching worker with a real MWPM decoder using the d=7 surface code's static Z parity check matrix via the cudaq-qec decoder plugin system. The 336 residual detectors from the AI predecoder are sliced into 14 spatial rounds of 24 Z-stabilizer syndromes and decoded independently. A mutex protects the decoder for thread safety across the 4-worker thread pool. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Extract hard-coded d=7 parameters into a PipelineConfig struct with static factory methods for d=7, d=13, d=21, and d=31 surface codes. Runtime config selection via command-line argument (d7|d13|d21|d31) preserves existing d=7 functionality while enabling larger-distance experiments. ONNX_MODEL_PATH replaced with ONNX_MODEL_DIR to support per-config model filenames. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a reusable header-only latency and throughput tracker for realtime decoding pipelines. Provides per-request submit/complete timestamping, percentile statistics (p50/p90/p95/p99), and a formatted report including wall time, throughput, and per-request breakdown. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
… requests Enhance PipelineBenchmark to distinguish submitted vs completed requests, report timeouts, and cap per-request output to 50 entries. Integrate it into the predecoder pipeline test with per-request submit/complete markers and spin-wait polling for accurate latency measurement. Increase default total_requests from 20 to 100 across all distance configs. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Instrument the PyMatching worker with high-resolution timestamps to measure decode time vs worker overhead. Report a breakdown showing PyMatching decode, worker overhead, and GPU+dispatch+poll latency as percentages of the total end-to-end pipeline, plus per-round latency. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Add engine caching: prefer a pre-built .engine file when available, otherwise build from ONNX and save the engine for subsequent runs. Replace the single mutex-protected PyMatching decoder with a pool of per-worker decoder instances using thread-local index assignment, eliminating lock contention in the decode path. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a streaming test mode alongside the existing batch mode, activated via CLI (`stream [rate_us] [duration_s]`). The streaming mode uses dedicated producer/consumer threads to simulate continuous FPGA syndrome arrival with configurable inter-arrival rate, in-flight throttling (capped to num_predecoders), backpressure tracking, and warmup period exclusion from latency stats. Reports steady-state throughput, percentile latencies, and per-round timing breakdown. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…nt kernel The CUDA device runtime has a hardcoded 128 fire-and-forget graph launch slot limit that is never reclaimed while a persistent parent kernel runs, making the device-side dispatcher unsuitable for sustained operation. This adds a host-side CPU dispatcher thread that polls rx_flags and calls cudaGraphLaunch from host code on per-predecoder CUDA streams, bypassing the device runtime limit entirely. Streaming mode uses the host dispatcher; batch mode retains the device-side dispatcher for backward compatibility. Key changes: - New host_dispatcher.h/.cpp with host_dispatcher_loop() - AIPreDecoderService::capture_graph() gains device_launch flag for conditional cudaGraphInstantiateFlagDeviceLaunch vs standard instantiation - d_queue_idx_ changed from cudaMalloc to cudaHostAllocMapped so the host dispatcher can read backpressure state without cudaMemcpy - Mailbox bank changed to mapped pinned memory for zero-copy host writes - Streaming test uses host dispatcher with per-predecoder streams Verified: d7 streaming 16,824 requests (219 us mean, 31 us/round), d13 streaming 6,227 requests (455 us mean, 35 us/round), zero errors. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…actor - Add host dispatcher with dynamic worker pool (idle_mask, inflight_slot_tags) to avoid head-of-line blocking; use libcu++ system-scope atomics for rx/tx/ready flags and mapped pinned memory. - Extend AIPreDecoderService and PreDecoderJob with origin_slot for out-of-order completion; default queue_depth 1 for host dispatch. - Add design doc (host_side_dispatcher_design_gemini.md) with spin-polling dispatcher and worker pseudocode/constraints. - Refactor test_realtime_predecoder_w_pymatching for dynamic pool and update CMakeLists; adjust nvqlink daemon and dispatch_kernel for host-side dispatch. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
This commit fundamentally redesigns the host-side execution model to achieve
microsecond-level latency, shifting from a general-purpose thread pool to
a strict, pinned, and lock-free architecture.
Key architectural changes in `test_realtime_predecoder_w_pymatching.cpp`:
1. Dedicated Polling Threads (Removed Thread Pool)
- Replaced `cudaq::qec::utils::ThreadPool` and the single `incoming_thread`
with a vector of dedicated `std::thread` worker loops.
- Eliminates queueing latency, mutex locking, and context switching
overhead. Each worker thread now spins continuously checking for its own
GPU completions.
2. Strict CPU Thread Pinning
- Introduced `pin_thread_to_core` and `pin_current_thread_to_core` using
the Linux `pthread_setaffinity_np` API.
- Pinned the Dispatcher (Core 2), Producer (Core 3), Consumer (Core 4),
and all Worker threads (Cores 10+) to ensure they never migrate, keeping
their CPU caches perfectly warm.
3. High-Resolution Sub-Component Timing
- Added tracking arrays (`dispatch_ts`, `poll_ts`, `debug_dispatch_ts_arr`)
piped through `WorkerPoolContext` and `PreDecoderJob`.
- Updated end-of-run reporting to calculate differences between timestamps,
proving that Host Dispatch overhead is negligible (~1-3µs) and the
bottleneck is the GPU inference itself.
4. PyMatching Data Conversion Optimization
- Inside `pymatching_worker_task`, replaced the conversion of `int32_t`
syndrome data into a `std::vector<double>`.
- Now populates a pre-allocated `cudaqx::tensor<uint8_t>` to avoid slow
double-precision conversions inside the latency-critical worker loop.
5. NVTX Profiling Markers
- Included `<nvtx3/nvToolsExt.h>` and wrapped key blocks in
`nvtxRangePushA` and `nvtxRangePop`.
- Enables generation of `nsys` profiles to visually align CPU thread
activity with GPU TensorRT execution.
Other changes:
- Enable TensorRT FP16 builder flag (`kFP16`) in `ai_decoder_service.cu`
for supported platforms to accelerate GPU inference.
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…e directory. Refactored the pymatching demo code to use the updated functions in realtime. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…nstrumentation
Remove the duplicate QEC-local host_dispatcher.{h,cpp} in favor of
the canonical realtime library versions, eliminating link ambiguity.
Fix three correctness/performance bugs in the streaming pipeline:
- Consumer was strict in-order, causing 327 µs head-of-line blocking
when parallel workers complete out of order. Changed to scan all
active slots and harvest whichever are ready.
- Dispatcher set tx_flags=READY immediately on graph launch (when
tx_data_host was non-null), causing phantom completions. Set
tx_data_host/dev to nullptr so dispatcher uses IN_FLIGHT sentinel.
- Race between consumer clearing tx_flags and resetting slot_request:
producer could see slot available and write slot_request before the
consumer's slot_request=-1, permanently orphaning the slot. Fixed
by resetting slot_request before clearing tx_flags with a store
fence (__sync_synchronize) for ARM memory ordering.
Replace broken timing breakdown (dispatch_ts was always 0, making the
entire report show "Other/Misc Wait") with a 3-stage per-request
breakdown: [A] submit→worker poll, [B] worker task, [C] consumer
poll lag, with p50/p99 percentiles.
Also: reduce NUM_SLOTS 64→16 to cut queuing delay, remove unused
queue_depth from PipelineConfig, add DISABLE_PYMATCHING conditional
compilation, add stuck-request diagnostics, and remove batch mode /
watchdog / dead code.
Results (d7, 8 workers, open-loop):
62.5K req/s, 230 µs mean latency, 500K/500K completed, 0 drops.
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Increase num_predecoders and num_workers from 8 to 16 across all config presets, and set NUM_SLOTS to 32. With 8 workers the pipeline capacity (~24K req/s) was below the 33K req/s arrival rate at 30 µs spacing, causing unbounded queuing and p99 latency spikes to 4.9 ms. With 16 workers and 32 slots, d13 at 30 µs arrival sustains 25K req/s with 299 µs mean latency (23 µs/round), p99 = 334 µs, and near-zero backpressure (9K stalls vs 38M previously). Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Add optimization profile in build_engine_from_onnx for ONNX models with dynamic dimensions (batch dim = 0). When detected, pin all dynamic dims to 1 via min/opt/max profile so TensorRT can build the engine. Previously these models failed with "Failed to build TRT engine from ONNX". Switch d13 config to predecoder_memory_d13_T13_X.onnx, which takes detectors as input rather than raw measurements. End-to-end latency drops from 299 µs to 226 µs, mainly from PyMatching (69 µs → 12 µs). Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Replace byte-by-byte memory copies with wider load/store operations to reduce memory transactions in the CUDA graph. The input kernel now uses uint32_t (4-byte) copies, the passthrough kernel uses uint4 (16-byte) copies, and the output kernel is replaced entirely with cudaMemcpyAsync (DMA copy engine) followed by a minimal 1-thread signal kernel. Thread counts bumped from 128 to 256. Reduces d13 mean end-to-end latency from 226 µs to 141 µs (~85 µs) and per-round latency from 17.4 µs to 10.8 µs. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Move the input copy from an SM-based kernel inside the CUDA graph to a host-issued cudaMemcpyAsync via a new pre_launch_fn callback on HostDispatchWorker. This frees GPU compute units for TRT inference and reduces Stage A latency by ~19 µs. Add get_trt_input_ptr() and get_host_ring_ptrs() accessors to support the callback wiring. Separate the T104 ONNX model into its own d13_r104 pipeline config (104 rounds, 32K slots) and restore d13_r13 to use the T13 model. Update design document to reflect DMA data movement, pre-launch callback, out-of-order consumer, and ARM memory ordering constraints. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Introduce a RealtimePipeline class (pipeline.h, realtime_pipeline.cu) that encapsulates all ring buffer allocation, atomic synchronization, dispatcher wiring, worker thread management, and consumer slot lifecycle behind a callback-driven API. Application code provides a GPU stage factory, a CPU stage callback, and a completion handler -- zero direct atomic access required. Refactor test_realtime_predecoder_w_pymatching.cpp from 1083 lines to ~470 lines by replacing inline atomics, thread management, and slot tracking with pipeline.submit() / pipeline.stop() / pipeline.stats(). Add d13_r104 config (T=104 model, 131K slot size). Signed-off-by: Scott Thornton <wsttiger@gmail.com>
21 tests covering AIDecoderService, AIPreDecoderService, and the host-side dispatcher. Correctness tests push 5,000 random 6.4 KB payloads through the full CUDA graph pipeline and verify bitwise identity. Integration tests exercise multi-predecoder concurrency and sustained throughput (200 requests, regression for the 128-launch limit fix). SKIP_TRT buffer size increased to 1600 floats to match realistic syndrome payload sizes. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
The legacy predecoder_input_kernel and its cudaq::nvqlink includes are no longer used since input data arrives via the pre_launch DMA callback. Design doc updated to reflect current code: removed kernel deletion, RealtimePipeline scaffolding, test suite, and SKIP_TRT buffer size (1600 floats). Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…provements Add GPU-only pipeline mode that skips CPU worker threads when no cpu_stage callback is registered, using cudaLaunchHostFunc for completion signaling instead. Add post_launch_fn/post_launch_data callback to HostDispatchWorker and GpuWorkerResources, called after successful cudaGraphLaunch. Rename CpuStageContext fields to gpu_output/gpu_output_size and AIPreDecoderService buffers to h_predecoder_outputs_/d_predecoder_outputs_ for clarity. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Address all critical (C1-C4) and major (M1-M12) defects identified during code review: Critical fixes: - C1: Fix race condition in try_submit via compare_exchange_weak - C2: Use uint64_t + separate occupancy flag for slot_request to support full request_id range (was int64_t with -1 sentinel) - C3: Add __syncthreads() before response header write in gateway_output_kernel to prevent partially-written result reads - C4: Always write IN_FLIGHT sentinel to tx_flags after graph launch Major fixes: - M1: Remove cudaSetDeviceFlags from RingBufferManager (caller's duty) - M2: Use std::atomic load with memory_order_acquire for tx/rx flag reads instead of plain volatile (ARM correctness) - M3: Validate num_workers <= 64 (idle_mask capacity) - M4: Validate gpu_factory is set before start() - M5: Check producer_stop in RingBufferInjector::submit to prevent infinite spin after shutdown - M6: Make started flag std::atomic<bool> - M7: Add CUDA error checks in AIDecoderService::capture_graph - M8: Check enqueueV3 return value in both service files - M9: Fix tensor_volume for dynamic-shape dims (was wrapping to SIZE_MAX on dim=-1) - M10: Assert num_workers == num_predecoders in benchmark - M11: Add aarch64 paths to predecoder test's TRT CMake search - M12: Replace vector<bool> with vector<uint8_t> to avoid concurrent write UB Also extracts submit logic into RingBufferInjector class to separate test infrastructure from pipeline core. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Collaborator
Author
|
The code in the top-level realtime directory is only temporary |
Collaborator
|
Noting: the |
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…exclusively The realtime/ source tree is removed from the build. All CMake targets (cudaq-realtime-pipeline, test_realtime_pipeline, and test_realtime_predecoder_w_pymatching) now discover headers and libraries from the CUDAQ_REALTIME_ROOT install prefix via find_path/find_library. - Remove add_subdirectory(realtime) from top-level CMakeLists.txt - Move pipeline.h to libs/qec/include/cudaq/qec/realtime/pipeline.h - Move realtime_pipeline.cu to libs/qec/lib/realtime/ - Rewrite cudaq-realtime-pipeline target to link against installed libs - Remove all in-tree TARGET cudaq-realtime branches from unittests CMake - Migrate cudaq::nvqlink:: namespace references to cudaq::realtime:: - Update #include paths from cudaq/nvqlink/ to cudaq/realtime/ - Delete the entire realtime/ source tree (13.5k lines) Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…RPATH Link cudaq-realtime-host-dispatch directly to the predecoder test to resolve a missing symbol at runtime (RUNPATH is not transitive). Reorder BUILD_RPATH so the local build directory is searched before the install prefix, ensuring the freshly built pipeline library is loaded. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Update all cudaqx-side consumers to use the renamed C-compatible host dispatcher types (cudaq_host_dispatcher_config_t, cudaq_host_dispatch_worker_t, cudaq_host_dispatcher_loop) with opaque void* atomic fields and pointer+count worker arrays. Fix uninitialized post_launch_fn causing segfault in dispatcher tests. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Adapt RingBufferManager and RingBufferInjector to pass request_id and ptp_timestamp through to the updated cudaq_host_ringbuffer_write_rpc_request API (now writes the full 24-byte RPCHeader). Zero-initialize RPCHeader in test_realtime_pipeline to avoid uninitialized fields. Tune d13_r104 config to 16 slots / 4 workers based on benchmarking (0 backpressure stalls, p50=169µs, p99=186µs). Signed-off-by: Scott Thornton <wsttiger@gmail.com>
The TRT engine for the predecoder model uses uint8 I/O, but trt_dtype_size() was missing the kUINT8 case, falling through to the default of 4 bytes. This caused 4x oversized buffer allocations, garbled model input (int32 values read as uint8 by TRT), and misinterpreted output — producing coin-flip LER (~0.50). Three fixes bring the pipeline to verified-correct LER of 0.002: 1. Add nvinfer1::DataType::kUINT8 to trt_dtype_size() (returns 1). Corrects buffer sizes, input copy, and output interpretation. 2. Replace the CUDA-Q surface_code H_z parity matrix with the Stim-derived full spacetime check matrix (H) and observables matrix (O), loaded from binary files in --data-dir. This gives PyMatching the correct matching graph and lets it project edge corrections onto the logical observable. 3. Add --data-dir support for loading pre-generated Stim detector samples and ground-truth observables. The producer feeds real uint8 detector data through the pipeline, and a post-run correctness report compares decode results against ground truth. Additional changes: - Derive slot_size, residual_detectors, and spatial_slices from the TRT model bindings at runtime instead of hardcoding in PipelineConfig - Read request_id from RPCHeader before overwriting with RPCResponse - Track per-request decode_corrections and logical_pred via request_id - Pre-allocate syndrome tensors with thread_local to avoid per-decode heap allocation - Bump d13_r104 config to 8 workers / 32 slots for full-H decode latency headroom - Guard QEC_CPU_RELAX macro against redefinition from host_dispatcher.h - Print TRT binding dtype and element size in setup diagnostics Signed-off-by: Scott Thornton <wsttiger@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Host-side AI predecoder pipeline with RealtimePipeline abstraction
Summary
Adds a complete host-side realtime decoding pipeline that pairs a GPU-based AI predecoder (TensorRT) with CPU-based PyMatching MWPM decoding, orchestrated by a new
RealtimePipelineC++ abstraction that hides all low-level atomics and thread management from application code.AIPreDecoderService/AIDecoderService): TensorRT inference wrapped in CUDA graphs with gateway kernels for ring buffer I/O, supporting ONNX model loading, dynamic batch dims, FP16, and engine cachingGpuStageFactory,CpuStageCallback,CompletionCallback), eliminating direct atomic access from application codecompare_exchange_weakslot claiming and backpressure trackinguint4loads), DMA-based output copy, out-of-order consumer harvesting, ARM memory ordering fixes (std::atomicacquire loads,__sync_synchronizefences)vector<bool>UBKey files
realtime/include/cudaq/realtime/pipeline.h,realtime/lib/pipeline/realtime_pipeline.curealtime/include/.../host_dispatcher.h,realtime/lib/.../host_dispatcher.cu,host_dispatcher_capi.curealtime/include/.../cudaq_realtime.h,realtime/lib/.../cudaq_realtime_api.cpplibs/qec/include/.../ai_decoder_service.h,ai_predecoder_service.h,libs/qec/lib/realtime/ai_decoder_service.cu,ai_predecoder_service.culibs/qec/lib/realtime/test_realtime_predecoder_w_pymatching.cpplibs/qec/unittests/test_realtime_pipeline.cu,realtime/unittests/test_host_dispatcher.cudocs/host_side_dispatcher_design_gemini.md,docs/hybrid_ai_predecoder_pipeline.mdTest plan
test_realtime_pipeline— 21 GTest cases pass (identity passthrough, multi-request correctness, shutdown, slot wraparound, sustained throughput)test_realtime_predecoder_w_pymatching— end-to-end benchmark runs with d7, d13_r13, d13_r104 configs at various injection ratestest_host_dispatcher— host dispatcher unit tests passtest_dispatch_kernel— dispatch kernel unit tests passSKIP_TRT=1passthrough mode works for CI environments without GPU/TRT