Skip autoregressive generation during QNN LLM calibration by abhinaykukkadapu · Pull Request #17786 · pytorch/executorch

abhinaykukkadapu · 2026-03-02T17:49:39Z

Fixes #17785 (Optimization 1)

During calibration, after prefilling the prompt tokens into the KV cache, the pipeline previously ran _generate — an autoregressive loop producing hundreds of tokens until EOS. This generation is unnecessary because the quantization observers already have sufficient activation statistics from the prefill pass.

For qwen2_5-1_5b (hybrid, max_seq_len=1024), this eliminates ~188 min of wasted computation (from 189 min down to <1 min for prompt calibration), reducing total end-to-end compilation time from ~8.1h to ~3.9h.

The skip_generate flag is threaded through graph_module_inference → GraphModuleCalibrationWrapper._model_call → kv_inference, which gates the _generate() call. Normal (non-calibration) inference is unaffected since the default is skip_generate=False.

Note: See the gh issue #17785 for profiling results

This PR was co-authored with Claude.

Fixes pytorch#17785 (Optimization 1) During calibration, after prefilling the prompt tokens into the KV cache, the pipeline previously ran `_generate` — an autoregressive loop producing hundreds of tokens until EOS. This generation is unnecessary because the quantization observers already have sufficient activation statistics from the prefill pass. For qwen2_5-1_5b (hybrid, max_seq_len=1024), this eliminates ~188 min of wasted computation (from 189 min down to <1 min for prompt calibration), reducing total end-to-end compilation time from ~8.1h to ~3.9h. The `skip_generate` flag is threaded through `graph_module_inference` → `GraphModuleCalibrationWrapper._model_call` → `kv_inference`, which gates the `_generate()` call. Normal (non-calibration) inference is unaffected since the default is `skip_generate=False`. This PR was co-authored with Claude.

pytorch-bot · 2026-03-02T17:49:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17786

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 3 New Failures

As of commit f925ac4 with merge base 25f2a3f ():

AWAITING APPROVAL - The following workflow needs approval before CI can run:

periodic (gh)

NEW FAILURES - The following jobs have failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 851aa67fdd108f5bf06ac530f5a75cb3731e29409ec6db7080888d2311fa78e2 /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 456a2c876261aaeb16daa43bf94e448a5050c2f7431d630b4b6e7b6d1aa52bcf /exec failed with exit code 1
pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
test_llama_stories_260k

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-02T17:50:19Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

cccclai · 2026-03-03T04:48:28Z

I think about it more and feel like the intention ideally should be passed in the actual prompt + output, instead of let it generate to a certain length. Then the model will be the same amount of data anyway instead of seeing data while generating (which can be expensive)

haowhsu-quic · 2026-03-03T06:25:39Z

I think _generate might be still necessary for some special tokens like reasoning which may not appear in the tasks.
Thanks for the thoughts from Chen, will figure out some possible reworks for the calibration process.

abhinaykukkadapu · 2026-03-03T06:45:36Z

Thanks @cccclai and @haowhsu-quic for the review.

intention ideally should be passed in the actual prompt + output

@cccclai do you mean we should replicate the prompt to max length to avoid triggering _generate during prompt calibration?

@haowhsu-quic Good point about special tokens. The current _generate loop doesn't guarantee those tokens appear either (depends on prompt?). A more reliable approach would be to include calibration prompts that explicitly contain the special tokens we care about?

haowhsu-quic · 2026-03-03T07:02:25Z

I think we will apply chat template on user prompts, like wrapping it with <think> for qwen3. Maybe having some quality dataset where model specific tokens are well-covered in the calibration process might be better (like Chen's thought if I understand correctly).

cccclai · 2026-03-03T17:32:01Z

I think the prompt is provided by users, and after apply chat template, the special tokens might show up (wouldn't it show up in wikitext too? did we only apply chat template to the prompt but not the task? ). My suggestion is mostly to have a prompt with generated output that covers the tokens, so calibration we see both prompt + generated output at once and we don't need to generate them while calibrating. It should at least improve the prefill calibration a lot.

abhinaykukkadapu requested a review from cccclai as a code owner March 2, 2026 17:49

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 2, 2026

abhinaykukkadapu requested a review from haowhsu-quic March 2, 2026 17:51

cccclai requested review from chenweng-quic, shewu-quic and winskuo-quic March 3, 2026 04:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip autoregressive generation during QNN LLM calibration#17786

Skip autoregressive generation during QNN LLM calibration#17786
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:skip-generate-calibration

abhinaykukkadapu commented Mar 2, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

cccclai commented Mar 3, 2026

Uh oh!

haowhsu-quic commented Mar 3, 2026

Uh oh!

abhinaykukkadapu commented Mar 3, 2026

Uh oh!

haowhsu-quic commented Mar 3, 2026 •

edited

Loading

Uh oh!

cccclai commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abhinaykukkadapu commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17786

❌ 1 Awaiting Approval, 3 New Failures

Uh oh!

github-actions bot commented Mar 2, 2026

This PR needs a release notes: label

Uh oh!

cccclai commented Mar 3, 2026

Uh oh!

haowhsu-quic commented Mar 3, 2026

Uh oh!

abhinaykukkadapu commented Mar 3, 2026

Uh oh!

haowhsu-quic commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cccclai commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abhinaykukkadapu commented Mar 2, 2026 •

edited

Loading

pytorch-bot bot commented Mar 2, 2026 •

edited

Loading

This PR needs a `release notes:` label

haowhsu-quic commented Mar 3, 2026 •

edited

Loading