Skip to content

Skip autoregressive generation during QNN LLM calibration#17786

Open
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:skip-generate-calibration
Open

Skip autoregressive generation during QNN LLM calibration#17786
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:skip-generate-calibration

Conversation

@abhinaykukkadapu
Copy link
Contributor

@abhinaykukkadapu abhinaykukkadapu commented Mar 2, 2026

Fixes #17785 (Optimization 1)

During calibration, after prefilling the prompt tokens into the KV cache, the pipeline previously ran _generate — an autoregressive loop producing hundreds of tokens until EOS. This generation is unnecessary because the quantization observers already have sufficient activation statistics from the prefill pass.

For qwen2_5-1_5b (hybrid, max_seq_len=1024), this eliminates ~188 min of wasted computation (from 189 min down to <1 min for prompt calibration), reducing total end-to-end compilation time from ~8.1h to ~3.9h.

The skip_generate flag is threaded through graph_module_inferenceGraphModuleCalibrationWrapper._model_callkv_inference, which gates the _generate() call. Normal (non-calibration) inference is unaffected since the default is skip_generate=False.

Note: See the gh issue #17785 for profiling results

This PR was co-authored with Claude.

Fixes pytorch#17785 (Optimization 1)

During calibration, after prefilling the prompt tokens into the KV cache,
the pipeline previously ran `_generate` — an autoregressive loop producing
hundreds of tokens until EOS. This generation is unnecessary because the
quantization observers already have sufficient activation statistics from
the prefill pass.

For qwen2_5-1_5b (hybrid, max_seq_len=1024), this eliminates ~188 min of
wasted computation (from 189 min down to <1 min for prompt calibration),
reducing total end-to-end compilation time from ~8.1h to ~3.9h.

The `skip_generate` flag is threaded through `graph_module_inference` →
`GraphModuleCalibrationWrapper._model_call` → `kv_inference`, which gates
the `_generate()` call. Normal (non-calibration) inference is unaffected
since the default is `skip_generate=False`.

This PR was co-authored with Claude.
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 2, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17786

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 3 New Failures

As of commit f925ac4 with merge base 25f2a3f (image):

AWAITING APPROVAL - The following workflow needs approval before CI can run:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@cccclai
Copy link
Contributor

cccclai commented Mar 3, 2026

I think about it more and feel like the intention ideally should be passed in the actual prompt + output, instead of let it generate to a certain length. Then the model will be the same amount of data anyway instead of seeing data while generating (which can be expensive)

@haowhsu-quic
Copy link
Collaborator

I think _generate might be still necessary for some special tokens like reasoning which may not appear in the tasks.
Thanks for the thoughts from Chen, will figure out some possible reworks for the calibration process.

@abhinaykukkadapu
Copy link
Contributor Author

Thanks @cccclai and @haowhsu-quic for the review.

intention ideally should be passed in the actual prompt + output

@cccclai do you mean we should replicate the prompt to max length to avoid triggering _generate during prompt calibration?

@haowhsu-quic Good point about special tokens. The current _generate loop doesn't guarantee those tokens appear either (depends on prompt?). A more reliable approach would be to include calibration prompts that explicitly contain the special tokens we care about?

@haowhsu-quic
Copy link
Collaborator

haowhsu-quic commented Mar 3, 2026

I think we will apply chat template on user prompts, like wrapping it with <think> for qwen3. Maybe having some quality dataset where model specific tokens are well-covered in the calibration process might be better (like Chen's thought if I understand correctly).

@cccclai
Copy link
Contributor

cccclai commented Mar 3, 2026

I think the prompt is provided by users, and after apply chat template, the special tokens might show up (wouldn't it show up in wikitext too? did we only apply chat template to the prompt but not the task? ). My suggestion is mostly to have a prompt with generated output that covers the tokens, so calibration we see both prompt + generated output at once and we don't need to generate them while calibrating. It should at least improve the prefill calibration a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skip calibrating with generated tokens in the calibration loop.

3 participants