-
-
Notifications
You must be signed in to change notification settings - Fork 322
perf: Disable thinking mode for local Qwen3/3.5 to improve inference speed #512
Description
Problem
The local Qwen3/3.5 models (for example qwen3.5-4b-q4_k_m) currently generates <think> reasoning tokens during inference. While the streaming code in ReasoningService.ts already strips these <think> blocks from the output, the model still spends time generating them, which significantly impacts inference speed on a small 4B model.
This is especially relevant for dictation/transcription cleanup — the primary use case — where thinking overhead adds latency without any benefit to the user.
Current behavior
- User runs inference with Qwen3.5 4B locally via llama.cpp
- Model generates
<think>...</think>tokens (reasoning phase) processTextStreaming()strips<think>blocks from the streamed output- User sees clean output but waits for thinking tokens to be generated first
Expected behavior
Thinking mode should be disabled at the inference level for Qwen3.5 4B (and potentially all local Qwen3.5 models) so the model skips reasoning entirely and responds faster.
Suggested approach
Qwen3/3.5 models support disabling thinking via:
- Chat template parameter: Pass
enable_thinking: falsevia llama.cpp's--chat-template-kwargs '{"enable_thinking": false}' - Prompt-level control: Append
/no_thinkto the user message or system prompt
Either approach would prevent the model from generating <think> tokens entirely, saving inference time rather than just stripping them after generation.
Context
- Groq's Qwen3 32B already has
disableThinking: trueinmodelRegistryData.json— a similar mechanism could be extended to local models - Issue feat: Add model-aware reasoning effort overrides #492 addresses reasoning effort overrides for cloud models, but doesn't cover local model thinking mode
- The Qwen3.5 4B model is 2.7GB — every token of unnecessary reasoning is proportionally expensive on consumer hardware
Affected models
qwen3.5-4b-q4_k_m(primary — smallest, most speed-sensitive)qwen3.5-2b-q4_k_m(same family, same issue)qwen3.5-9b-q4_k_m(less critical but still applies)qwen3-4b-q4_k_m,qwen3-8b-*,qwen3-1.7b-*(Qwen3 family, same thinking behavior)