Qualcomm AI Engine Direct - remove prefill calibration by haowhsu-quic · Pull Request #17805 · pytorch/executorch

haowhsu-quic · 2026-03-03T06:05:41Z

Summary

calibrate kv text decoder only to reduce calibration time
deprecate outdated implementation & use deterministic example inputs for llm

Total Quantization Time

Model	Before(s)	After(s)	Improvement
gemma-2b	2203.399	999.512	54.64%
gemma2-2b	2177.285	1001.248	54.01%
gemma3-1b	1776.861	548.312	69.14%
glm-1_5b	1434.780	677.257	52.8%
granite_3_3-2b	59566.790	6165.443	89.65%
llama3_2-1b	4528.620	2953.233	34.79%
llama3_2-3b	5744.429	1652.157	71.24%
phi_4_mini	7005.601	2071.634	84.56%
qwen2_5-0_5b	480.508	372.076	22.57%
qwen2_5-1_5b	2064.333	899.164	56.44%
qwen3-0_6b	1673.150	1124.149	32.81%
qwen3-1_7b	3253.723	1148.511	64.7%
smollm2_135m	502.779	414.510	17.56%
smollm3-3b	4663.057	1613.516	65.4%
smolvlm_500m_instruct	288.246	170.829	40.73%
internvl3_1b	256.624	170.811	33.44%

Test plan

python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript / TestExampleMultimodalityScript

- calibrate kv text decoder only to reduce calibration time

pytorch-bot · 2026-03-03T06:05:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17805

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 311249c with merge base 0c2ff55 ():

NEW FAILURES - The following jobs have failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 72024bfadeca5197ac4bf1d24a523e1624a4b3c22977862ee34bfc1bc52d0a83 /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 478b224727c391415fae78f860784cbcd381e82ab6321de09f311dc268e41c73 /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d
pull / unittest-editable / windows / windows-job (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

haowhsu-quic · 2026-03-03T06:08:35Z

Hi @cccclai, this PR is to reduce the calibration time of llms. Might be able to mitigate #17784 a bit, will keep working on other possible optimizations.
I also notice that #17718 is breaking torchtune package, could you help update torchtune repo? Thank you!

haowhsu-quic · 2026-03-03T06:26:55Z

@pytorchbot label "release notes: qualcomm"

cccclai · 2026-03-03T17:24:09Z

Thanks a lot! In addition to this, I noticed that SeqMSE grid searching is done in sequential instead of parallel, is there room to improve there?

haowhsu-quic · 2026-03-04T01:15:24Z

Thanks a lot! In addition to this, I noticed that SeqMSE grid searching is done in sequential instead of parallel, is there room to improve there?

Yes, will look into it.

cccclai · 2026-03-04T17:34:08Z

I also notice that #17718 is breaking torchtune package, could you help update torchtune repo?

can you share what's broken? Is it the flow to consume torchtune in executorch repo?

cccclai · 2026-03-04T17:34:46Z

examples/qualcomm/oss_scripts/llama/model/static_llama.py

                # transpose first to decrease the runtime efforts
                k_cache.append(
-                    torch.zeros(
+                    torch.ones(


Are we initialized kv cache with different values?

cccclai · 2026-03-04T17:35:57Z

examples/qualcomm/oss_scripts/llama/wrappers/llm_wrappers.py

+            if list(decode_node.users)[0].target in ptq_target:
+                activation_override(decode_node, prefill_node)
+
+        # copy encoding for hybrid mode


Are you copying over the quantization parameters from kv mode to prefill mode?

cccclai · 2026-03-04T17:36:31Z

examples/qualcomm/oss_scripts/llama/wrappers/llm_wrappers.py

+        #
+        # however, pytorch will use different computaion kernels for different
+        # workloads (AR1 vs ARN) which will introduce some numerical discrepancy.
+        #


what is the mechanism to make sure the encoding align correctly?

I'm worried about the accuracy too if we get rid of prefill calibration, do you think if we generate prompt + output using fp32 model (pre observers) as discussed in PR #17786 and run prefill + decode as before with skip_generate might yield better accuracy, rather than getting rid of entire prefill calibration?

prefill calibration ideally is not needed because decode see all the generated tokens too and prefill graph and decode graph should be the same. I remember @haowhsu-quic mentioned we insert the kv cache of the output of prefill and connect to the decode of the input to make sure those quant nodes are also calibrated. I did a comparison for quant params between prefill and decode in the past and they are very very close. I'm trying to figure out if this PR handle kv cache differently than before.

I guess my question is prefill "sees" previous tokens and as attention block will take these into consideration while generating kv-cache.

For weights, what you said makes sense as they are idempotent from the math standpoint, maybe we should just check the PPL on-device?

CC: @metascroy, @kimishpatel if you have any thoughts on this.

Qualcomm AI Engine Direct - remove prefill calibration

311249c

- calibrate kv text decoder only to reduce calibration time

haowhsu-quic requested a review from cccclai as a code owner March 3, 2026 06:05

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 3, 2026

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Mar 3, 2026

cccclai requested review from abhinaykukkadapu and navsud March 3, 2026 17:23

This was linked to issues Mar 3, 2026

QNN LLM Compilation Pipeline — Profiling & Optimization #17784

Open

Remove prefill calibration #17827

Open

abhinaykukkadapu removed a link to an issue Mar 3, 2026

QNN LLM Compilation Pipeline — Profiling & Optimization #17784

Open

abhinaykukkadapu mentioned this pull request Mar 3, 2026

Skip autoregressive generation during QNN LLM calibration #17786

Open

cccclai reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - remove prefill calibration#17805

Qualcomm AI Engine Direct - remove prefill calibration#17805
haowhsu-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev_kv_calibration_only

haowhsu-quic commented Mar 3, 2026

Uh oh!

pytorch-bot bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

haowhsu-quic commented Mar 3, 2026 •

edited

Loading

Uh oh!

haowhsu-quic commented Mar 3, 2026

Uh oh!

cccclai commented Mar 3, 2026

Uh oh!

haowhsu-quic commented Mar 4, 2026

Uh oh!

cccclai commented Mar 4, 2026

Uh oh!

cccclai Mar 4, 2026

Uh oh!

cccclai Mar 4, 2026

Uh oh!

cccclai Mar 4, 2026

Uh oh!

abhinaykukkadapu Mar 4, 2026

Uh oh!

cccclai Mar 4, 2026 •

edited

Loading

Uh oh!

abhinaykukkadapu Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haowhsu-quic commented Mar 3, 2026

Summary

Test plan

Uh oh!

pytorch-bot bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17805

❌ 4 New Failures

Uh oh!

haowhsu-quic commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haowhsu-quic commented Mar 3, 2026

Uh oh!

cccclai commented Mar 3, 2026

Uh oh!

haowhsu-quic commented Mar 4, 2026

Uh oh!

cccclai commented Mar 4, 2026

Uh oh!

cccclai Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Mar 3, 2026 •

edited

Loading

haowhsu-quic commented Mar 3, 2026 •

edited

Loading

cccclai Mar 4, 2026 •

edited

Loading