Skip autoregressive generation during QNN LLM calibration#17786
Skip autoregressive generation during QNN LLM calibration#17786abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
Conversation
Fixes pytorch#17785 (Optimization 1) During calibration, after prefilling the prompt tokens into the KV cache, the pipeline previously ran `_generate` — an autoregressive loop producing hundreds of tokens until EOS. This generation is unnecessary because the quantization observers already have sufficient activation statistics from the prefill pass. For qwen2_5-1_5b (hybrid, max_seq_len=1024), this eliminates ~188 min of wasted computation (from 189 min down to <1 min for prompt calibration), reducing total end-to-end compilation time from ~8.1h to ~3.9h. The `skip_generate` flag is threaded through `graph_module_inference` → `GraphModuleCalibrationWrapper._model_call` → `kv_inference`, which gates the `_generate()` call. Normal (non-calibration) inference is unaffected since the default is `skip_generate=False`. This PR was co-authored with Claude.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17786
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Awaiting Approval, 3 New FailuresAs of commit f925ac4 with merge base 25f2a3f ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
I think about it more and feel like the intention ideally should be passed in the actual prompt + output, instead of let it generate to a certain length. Then the model will be the same amount of data anyway instead of seeing data while generating (which can be expensive) |
|
I think |
|
Thanks @cccclai and @haowhsu-quic for the review.
@cccclai do you mean we should replicate the prompt to max length to avoid triggering _generate during prompt calibration? @haowhsu-quic Good point about special tokens. The current |
|
I think we will apply chat template on user prompts, like wrapping it with <think> for qwen3. Maybe having some quality dataset where model specific tokens are well-covered in the calibration process might be better (like Chen's thought if I understand correctly). |
|
I think the prompt is provided by users, and after apply chat template, the special tokens might show up (wouldn't it show up in wikitext too? did we only apply chat template to the prompt but not the task? ). My suggestion is mostly to have a prompt with generated output that covers the tokens, so calibration we see both prompt + generated output at once and we don't need to generate them while calibrating. It should at least improve the prefill calibration a lot. |
Fixes #17785 (Optimization 1)
During calibration, after prefilling the prompt tokens into the KV cache, the pipeline previously ran
_generate— an autoregressive loop producing hundreds of tokens until EOS. This generation is unnecessary because the quantization observers already have sufficient activation statistics from the prefill pass.For qwen2_5-1_5b (hybrid, max_seq_len=1024), this eliminates ~188 min of wasted computation (from 189 min down to <1 min for prompt calibration), reducing total end-to-end compilation time from ~8.1h to ~3.9h.
The
skip_generateflag is threaded throughgraph_module_inference→GraphModuleCalibrationWrapper._model_call→kv_inference, which gates the_generate()call. Normal (non-calibration) inference is unaffected since the default isskip_generate=False.Note: See the gh issue #17785 for profiling results
This PR was co-authored with Claude.