Qualcomm AI Engine Direct - remove prefill calibration#17805
Qualcomm AI Engine Direct - remove prefill calibration#17805haowhsu-quic wants to merge 1 commit intopytorch:mainfrom
Conversation
- calibrate kv text decoder only to reduce calibration time
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17805
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit 311249c with merge base 0c2ff55 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
Thanks a lot! In addition to this, I noticed that SeqMSE grid searching is done in sequential instead of parallel, is there room to improve there? |
Yes, will look into it. |
can you share what's broken? Is it the flow to consume torchtune in executorch repo? |
| # transpose first to decrease the runtime efforts | ||
| k_cache.append( | ||
| torch.zeros( | ||
| torch.ones( |
There was a problem hiding this comment.
Are we initialized kv cache with different values?
| if list(decode_node.users)[0].target in ptq_target: | ||
| activation_override(decode_node, prefill_node) | ||
|
|
||
| # copy encoding for hybrid mode |
There was a problem hiding this comment.
Are you copying over the quantization parameters from kv mode to prefill mode?
| # | ||
| # however, pytorch will use different computaion kernels for different | ||
| # workloads (AR1 vs ARN) which will introduce some numerical discrepancy. | ||
| # |
There was a problem hiding this comment.
what is the mechanism to make sure the encoding align correctly?
There was a problem hiding this comment.
I'm worried about the accuracy too if we get rid of prefill calibration, do you think if we generate prompt + output using fp32 model (pre observers) as discussed in PR #17786 and run prefill + decode as before with skip_generate might yield better accuracy, rather than getting rid of entire prefill calibration?
There was a problem hiding this comment.
prefill calibration ideally is not needed because decode see all the generated tokens too and prefill graph and decode graph should be the same. I remember @haowhsu-quic mentioned we insert the kv cache of the output of prefill and connect to the decode of the input to make sure those quant nodes are also calibrated. I did a comparison for quant params between prefill and decode in the past and they are very very close. I'm trying to figure out if this PR handle kv cache differently than before.
There was a problem hiding this comment.
I guess my question is prefill "sees" previous tokens and as attention block will take these into consideration while generating kv-cache.
For weights, what you said makes sense as they are idempotent from the math standpoint, maybe we should just check the PPL on-device?
CC: @metascroy, @kimishpatel if you have any thoughts on this.
Summary
Total Quantization Time
Test plan
python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript / TestExampleMultimodalityScript