Add CUDA argmax kernel for LLM sampler #16386

larryliu0820 · 2025-12-24T01:46:55Z

Add a CUDA kernel for argmax operation to support GPU-based sampling:

argmax.cuh: Template kernel using warp-level reductions with __shfl_xor_sync
for efficient parallel max finding. Supports float, half, and bfloat16.
argmax.cu: Wrapper function argmax_cuda() that launches the kernel,
handles device-to-host copy, and synchronization.
test_argmax.cu: Comprehensive unit tests covering various vocab sizes,
data types, edge cases, and numerical precision.
CMakeLists.txt: Build configuration for extension_llm_sampler_cuda library
and GoogleTest-based unit tests.

[ghstack-poisoned]

larryliu0820 · 2025-12-24T01:46:56Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-12-24T01:46:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16386

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit f8cd4d2 with merge base c5d66a5 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/sampler/test/test_argmax.cu:
pull / test-samsung-models-linux / linux-job (gh)
test_inception_v3_fp16

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test Metal Backend / export-model-metal-artifact (openai, whisper-small, non-quantized) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Update

f8cd4d2

[ghstack-poisoned]

larryliu0820 requested review from kirklandsign and mergennachin as code owners December 24, 2025 01:46

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 24, 2025

This was referenced Dec 24, 2025

Add CudaSampler class for GPU-based token sampling #16387

Open

Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

Open

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Dec 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA argmax kernel for LLM sampler #16386

Add CUDA argmax kernel for LLM sampler #16386

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add CUDA argmax kernel for LLM sampler #16386

Are you sure you want to change the base?

Add CUDA argmax kernel for LLM sampler #16386

Conversation

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16386

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

larryliu0820 commented Dec 24, 2025 •

edited

Loading

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading