Skip to content

Qualcomm AI Engine Direct - remove prefill calibration#17805

Open
haowhsu-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev_kv_calibration_only
Open

Qualcomm AI Engine Direct - remove prefill calibration#17805
haowhsu-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev_kv_calibration_only

Conversation

@haowhsu-quic
Copy link
Collaborator

Summary

  • calibrate kv text decoder only to reduce calibration time
  • deprecate outdated implementation & use deterministic example inputs for llm

Total Quantization Time

Model Before(s) After(s) Improvement
gemma-2b 2203.399 999.512 54.64%
gemma2-2b 2177.285 1001.248 54.01%
gemma3-1b 1776.861 548.312 69.14%
glm-1_5b 1434.780 677.257 52.8%
granite_3_3-2b 59566.790 6165.443 89.65%
llama3_2-1b 4528.620 2953.233 34.79%
llama3_2-3b 5744.429 1652.157 71.24%
phi_4_mini 7005.601 2071.634 84.56%
qwen2_5-0_5b 480.508 372.076 22.57%
qwen2_5-1_5b 2064.333 899.164 56.44%
qwen3-0_6b 1673.150 1124.149 32.81%
qwen3-1_7b 3253.723 1148.511 64.7%
smollm2_135m 502.779 414.510 17.56%
smollm3-3b 4663.057 1613.516 65.4%
smolvlm_500m_instruct 288.246 170.829 40.73%
internvl3_1b 256.624 170.811 33.44%

Test plan

python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript / TestExampleMultimodalityScript

- calibrate kv text decoder only to reduce calibration time
@haowhsu-quic haowhsu-quic requested a review from cccclai as a code owner March 3, 2026 06:05
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 3, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17805

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 311249c with merge base 0c2ff55 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 3, 2026
@haowhsu-quic
Copy link
Collaborator Author

haowhsu-quic commented Mar 3, 2026

Hi @cccclai, this PR is to reduce the calibration time of llms. Might be able to mitigate #17784 a bit, will keep working on other possible optimizations.
I also notice that #17718 is breaking torchtune package, could you help update torchtune repo? Thank you!

@haowhsu-quic
Copy link
Collaborator Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Mar 3, 2026
@cccclai
Copy link
Contributor

cccclai commented Mar 3, 2026

Thanks a lot! In addition to this, I noticed that SeqMSE grid searching is done in sequential instead of parallel, is there room to improve there?

@haowhsu-quic
Copy link
Collaborator Author

Thanks a lot! In addition to this, I noticed that SeqMSE grid searching is done in sequential instead of parallel, is there room to improve there?

Yes, will look into it.

@cccclai
Copy link
Contributor

cccclai commented Mar 4, 2026

I also notice that #17718 is breaking torchtune package, could you help update torchtune repo?

can you share what's broken? Is it the flow to consume torchtune in executorch repo?

# transpose first to decrease the runtime efforts
k_cache.append(
torch.zeros(
torch.ones(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we initialized kv cache with different values?

if list(decode_node.users)[0].target in ptq_target:
activation_override(decode_node, prefill_node)

# copy encoding for hybrid mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you copying over the quantization parameters from kv mode to prefill mode?

#
# however, pytorch will use different computaion kernels for different
# workloads (AR1 vs ARN) which will introduce some numerical discrepancy.
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the mechanism to make sure the encoding align correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the accuracy too if we get rid of prefill calibration, do you think if we generate prompt + output using fp32 model (pre observers) as discussed in PR #17786 and run prefill + decode as before with skip_generate might yield better accuracy, rather than getting rid of entire prefill calibration?

Copy link
Contributor

@cccclai cccclai Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefill calibration ideally is not needed because decode see all the generated tokens too and prefill graph and decode graph should be the same. I remember @haowhsu-quic mentioned we insert the kv cache of the output of prefill and connect to the decode of the input to make sure those quant nodes are also calibrated. I did a comparison for quant params between prefill and decode in the past and they are very very close. I'm trying to figure out if this PR handle kv cache differently than before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is prefill "sees" previous tokens and as attention block will take these into consideration while generating kv-cache.

For weights, what you said makes sense as they are idempotent from the math standpoint, maybe we should just check the PPL on-device?

CC: @metascroy, @kimishpatel if you have any thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove prefill calibration

3 participants