Voxtral Realtime: enable CUDA backend with int4 quantization by mergennachin · Pull Request #17798 · pytorch/executorch

mergennachin · 2026-03-02T22:30:05Z

Add CUDA/AOTI backend support for the Voxtral Realtime model alongside
the existing XNNPACK and Metal backends.

Model (model.py):

CudaSDPA: F.scaled_dot_product_attention with repeat_interleave for
GQA expansion and boolean attention masks (Triton SDPA requirement)
StaticKVCache (shared with Metal) for [B,H,S,D] layout with index_copy_
StandardEncoderRingKVCache/StandardEncoderSDPA for streaming encoder
_build_causal_mask_bool: 4D boolean mask for Triton compatibility
Simplified LMAttention.forward to always pass attn_mask (None for XNNPACK)

Export (export_voxtral_rt.py):

--backend cuda with CudaPartitioner and conv1d_to_conv2d decomposition
--dtype flag (default fp32, bf16 for CUDA Triton SDPA)
--qlinear-packing-format / --qlinear-encoder-packing-format for
tile_packed_to_4d int4 quantization
CUDA device placement, Dim.AUTO for audio encoder, .ptd output

Runner (main.cpp, voxtral_realtime_runner.cpp/.h):

--data_path flag for .ptd delegate data (CUDA compiled kernels)
Module two-arg constructor for pte+ptd loading

Build (CMakePresets.json, Makefile):

voxtral-realtime-cuda preset
make voxtral_realtime-cuda target

CI (.github/workflows/cuda.yml, .ci/scripts/):

Voxtral Realtime in CUDA CI matrix (int4-tile-packed, offline mode)
Export/test scripts updated for CUDA quantization args and data path

Add CUDA/AOTI backend support for the Voxtral Realtime model alongside the existing XNNPACK and Metal backends. Model (model.py): - CudaSDPA: F.scaled_dot_product_attention with repeat_interleave for GQA expansion and boolean attention masks (Triton SDPA requirement) - StaticKVCache (shared with Metal) for [B,H,S,D] layout with index_copy_ - StandardEncoderRingKVCache/StandardEncoderSDPA for streaming encoder - _build_causal_mask_bool: 4D boolean mask for Triton compatibility - Simplified LMAttention.forward to always pass attn_mask (None for XNNPACK) Export (export_voxtral_rt.py): - --backend cuda with CudaPartitioner and conv1d_to_conv2d decomposition - --dtype flag (default fp32, bf16 for CUDA Triton SDPA) - --qlinear-packing-format / --qlinear-encoder-packing-format for tile_packed_to_4d int4 quantization - CUDA device placement, Dim.AUTO for audio encoder, .ptd output Runner (main.cpp, voxtral_realtime_runner.cpp/.h): - --data_path flag for .ptd delegate data (CUDA compiled kernels) - Module two-arg constructor for pte+ptd loading Build (CMakePresets.json, Makefile): - voxtral-realtime-cuda preset - make voxtral_realtime-cuda target CI (.github/workflows/cuda.yml, .ci/scripts/): - Voxtral Realtime in CUDA CI matrix (int4-tile-packed, offline mode) - Export/test scripts updated for CUDA quantization args and data path

pytorch-bot · 2026-03-02T22:30:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17798

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Unrelated Failure

As of commit 1e5399a with merge base 25f2a3f ():

NEW FAILURES - The following jobs have failed:

Copilot code review / Upload results (gh)
Process completed with exit code 1.
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t a91379aad4ad04e948e62993cc99d89b32f31ef347fe6a9d5aea5bafaa9151ca /exec failed with exit code 139
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
RuntimeError: Command docker exec -t e10ea2b666f30ba9dc31cdaae0eceb91756c9f0954fd3388413558000f0d5688 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t dfc949046625caaa85642f5b79021775ae516d8f221c793eb95d8f238beef90b /exec failed with exit code 134
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh)
Process completed with exit code 1.
trunk / test-mcu-cortex-m-backend / linux-job (gh)
RuntimeError: Command docker exec -t ce624419fcb5990050841c36bef7e101d472c956970cf373f36c9d18a6a9c585 /exec failed with exit code 1
trunk / unittest-release / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-02T22:30:45Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings March 2, 2026 22:30

mergennachin requested review from kirklandsign, larryliu0820 and lucylq as code owners March 2, 2026 22:30

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 2, 2026

mergennachin requested a review from manuelcandales March 2, 2026 22:30

Copilot started reviewing on behalf of mergennachin March 2, 2026 22:32 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results March 2, 2026 23:33 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voxtral Realtime: enable CUDA backend with int4 quantization#17798

Voxtral Realtime: enable CUDA backend with int4 quantization#17798
mergennachin wants to merge 1 commit intomainfrom
enable_voxtral_realtime

mergennachin commented Mar 2, 2026

Uh oh!

pytorch-bot bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mergennachin commented Mar 2, 2026

Uh oh!

pytorch-bot bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17798

❌ 7 New Failures, 1 Unrelated Failure

Uh oh!

github-actions bot commented Mar 2, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 2, 2026 •

edited

Loading

This PR needs a `release notes:` label