Skip to content

perf: Disable thinking mode for local Qwen3/3.5 to improve inference speed #512

@jalvarezz13

Description

@jalvarezz13

Problem

The local Qwen3/3.5 models (for example qwen3.5-4b-q4_k_m) currently generates <think> reasoning tokens during inference. While the streaming code in ReasoningService.ts already strips these <think> blocks from the output, the model still spends time generating them, which significantly impacts inference speed on a small 4B model.

This is especially relevant for dictation/transcription cleanup — the primary use case — where thinking overhead adds latency without any benefit to the user.

Current behavior

  1. User runs inference with Qwen3.5 4B locally via llama.cpp
  2. Model generates <think>...</think> tokens (reasoning phase)
  3. processTextStreaming() strips <think> blocks from the streamed output
  4. User sees clean output but waits for thinking tokens to be generated first

Expected behavior

Thinking mode should be disabled at the inference level for Qwen3.5 4B (and potentially all local Qwen3.5 models) so the model skips reasoning entirely and responds faster.

Suggested approach

Qwen3/3.5 models support disabling thinking via:

  • Chat template parameter: Pass enable_thinking: false via llama.cpp's --chat-template-kwargs '{"enable_thinking": false}'
  • Prompt-level control: Append /no_think to the user message or system prompt

Either approach would prevent the model from generating <think> tokens entirely, saving inference time rather than just stripping them after generation.

Context

  • Groq's Qwen3 32B already has disableThinking: true in modelRegistryData.json — a similar mechanism could be extended to local models
  • Issue feat: Add model-aware reasoning effort overrides #492 addresses reasoning effort overrides for cloud models, but doesn't cover local model thinking mode
  • The Qwen3.5 4B model is 2.7GB — every token of unnecessary reasoning is proportionally expensive on consumer hardware

Affected models

  • qwen3.5-4b-q4_k_m (primary — smallest, most speed-sensitive)
  • qwen3.5-2b-q4_k_m (same family, same issue)
  • qwen3.5-9b-q4_k_m (less critical but still applies)
  • qwen3-4b-q4_k_m, qwen3-8b-*, qwen3-1.7b-* (Qwen3 family, same thinking behavior)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions