Skip to content

fix: prevent gateway race condition when switching providers#1190

Closed
Jah-yee wants to merge 6 commits intoNousResearch:mainfrom
Jah-yee:fix/provider-race-condition
Closed

fix: prevent gateway race condition when switching providers#1190
Jah-yee wants to merge 6 commits intoNousResearch:mainfrom
Jah-yee:fix/provider-race-condition

Conversation

@Jah-yee
Copy link
Contributor

@Jah-yee Jah-yee commented Mar 13, 2026

When running hermes setup or hermes model while the gateway is running, _update_config_for_provider() writes to config.yaml immediately with the new provider/base_url but preserves the old model name. This creates a race condition where the gateway can send requests with an incompatible model name to the new provider.

The Problem

  • User has OpenRouter with anthropic/claude-opus-4.6 configured
  • User runs hermes setup and selects MiniMax as provider
  • _update_config_for_provider() writes: provider=minimax, base_url=... but model still = anthropic/claude-opus-4.6
  • Gateway picks up the config change and sends anthropic/claude-opus-4.6 to MiniMax API → fails

The Fix

  1. Adds optional default_model parameter to _update_config_for_provider() in auth.py
  2. When switching to affected providers (minimax, minimax-cn, zai, kimi-coding), pass a sensible default model
  3. The model selection step later can still override this default

Affected Providers

  • MiniMax (default: MiniMax-M2.5)
  • MiniMax-CN (default: MiniMax-M2.5)
  • Z.AI (default: glm-4.7)
  • Kimi (default: kimi-k2.5)

These providers use different model name formats than OpenRouter, so the old model name is always incompatible.

Jah-yee and others added 6 commits March 12, 2026 16:38
Allows users to override the hardcoded 900s timeout when using
local LLM providers like Ollama or LM Studio.

Fixes NousResearch#1010
When llama.cpp returns function call responses, message.content can be a
dict instead of a string, causing 'dict' object has no attribute 'strip'
error. This fix adds type checking before calling .strip().
- Bridge stt.enabled from config.yaml to HERMES_STT_ENABLED env var
- Check env var in _enrich_message_with_transcription before processing
- When stt.enabled: false, voice messages pass through without transcription

Fixes: NousResearch#1100
When users explicitly set at_hour or idle_minutes to null in their
config.yaml, the from_dict() method now correctly applies default values
instead of passing None to validation logic.

Fixes: NousResearch#1119
When running 'hermes setup' or 'hermes model' while the gateway is
running, _update_config_for_provider() writes to config.yaml immediately
with the new provider/base_url but preserves the old model name. This
creates a race condition where the gateway can send requests with an
incompatible model name to the new provider.

This fix:
1. Adds optional 'default_model' parameter to _update_config_for_provider()
2. When switching to affected providers (minimax, minimax-cn, zai,
   kimi-coding), pass a sensible default model to prevent the race
3. The model selection step later can still override this default

Affected providers: MiniMax, MiniMax-CN, Z.AI, Kimi
These providers use different model name formats than OpenRouter.
In setup.py, _update_config_for_provider was called without default_model
for OpenAI Codex, causing a race condition where:
1. Provider is updated to openai-codex in config.yaml
2. Gateway picks up new provider
3. But model is still the old one (e.g., anthropic/claude-opus-4.6 from OpenRouter)
4. Gateway sends wrong model to Codex → fails

This fix:
- Line ~598: Pass 'gpt-5.3-codex' as default when first setting up Codex
- Line ~936: Pass the selected model (or fallback to default) to ensure
  the config always has a valid model for the current provider

This prevents the race condition where the gateway uses a model name
from a different provider after provider switch.
@teknium1
Copy link
Contributor

Closing — the core race condition fix (default_model parameter on _update_config_for_provider()) is already on main. The function now writes a valid default model when switching providers to prevent the gateway from using an incompatible model name.

The other changes bundled in this PR (context compressor non-string handling, SessionResetPolicy null values, STT enable/disable, configurable timeout) have also been addressed independently in the 1084 commits since this PR was opened.

Thank you for identifying the race condition @Jah-yee!

@teknium1 teknium1 closed this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants