Skip to content

Conversation

@hebangwen
Copy link
Contributor

This PR is trying to address #3947 . When benchmarking gemma3, this model generates eos token very quick, i.e. 3~4 tokens, but we still calculate the decoding tps by 128. So this model displays a very high decoding speed.

before fix:

model modelSize backend threads precision llm_demo speed(tok/s)
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=128
decode=128
45.88 ± 0.54
316.10 ± 2.58
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=256
decode=128
45.63 ± 0.53
311.16 ± 1.67
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=512
decode=128
45.00 ± 0.34
11.91 ± 0.14

after fix:

model modelSize backend threads precision llm_demo speed(tok/s)
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=128
decode=128
44.90 ± 0.46
12.28 ± 0.22
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=256
decode=128
45.41 ± 0.14
12.22 ± 0.04
gemma-3-1b-it-qat-q4_0-gguf-MNN 994.65 MiB CPU 4 Low prompt=512
decode=128
45.00 ± 0.08
12.04 ± 0.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant