feat(inference): multi-route proxy with alias-based model routing#618
feat(inference): multi-route proxy with alias-based model routing#618cosmicnet wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
There was a problem hiding this comment.
Pull request overview
Adds multi-route inference proxying so sandboxes can route inference.local requests to multiple LLM backends by using a model alias in the request body.
Changes:
- Extends the inference proto + gateway storage to support multiple
(alias, provider_name, model_id)entries per route. - Adds alias-first route selection in the router and passes a
model_hintextracted from sandbox request bodies. - Expands sandbox L7 inference patterns and adds an Ollama provider profile + endpoint validation probe.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/inference.proto | Adds InferenceModelEntry and models fields for multi-model inference config. |
| crates/openshell-server/src/inference.rs | Implements multi-model upsert + resolves each alias into separate ResolvedRoute entries. |
| crates/openshell-sandbox/src/proxy.rs | Extracts model from JSON body and forwards it as model_hint to the router. |
| crates/openshell-sandbox/src/l7/inference.rs | Adds Codex + Ollama native API patterns and tests. |
| crates/openshell-router/src/lib.rs | Adds select_route() and extends proxy APIs to accept model_hint. |
| crates/openshell-router/src/backend.rs | Adds Ollama validation probe and changes backend URL construction behavior. |
| crates/openshell-router/tests/backend_integration.rs | Updates tests for new proxy function signatures and /v1 endpoint expectations. |
| crates/openshell-core/src/inference.rs | Adds OLLAMA_PROFILE (protocols/base URL/config keys). |
| crates/openshell-cli/src/run.rs | Adds gateway_inference_set_multi() to send multi-model configs. |
| crates/openshell-cli/src/main.rs | Adds --model-alias ALIAS=PROVIDER/MODEL CLI flag and dispatch. |
| architecture/inference-routing.md | Documents alias-based route selection, new patterns, and multi-model route behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af1748b to
ab71175
Compare
Add pattern detection, provider profile, and validation probe for Ollama's native /api/chat, /api/tags, and /api/show endpoints. Proxy changes (l7/inference.rs): - POST /api/chat -> ollama_chat protocol - GET /api/tags -> ollama_model_discovery protocol - POST /api/show -> ollama_model_discovery protocol Provider profile (openshell-core/inference.rs): - New 'ollama' provider type with default endpoint http://host.openshell.internal:11434 - Supports ollama_chat, ollama_model_discovery, and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery) - Credential lookup via OLLAMA_API_KEY, base URL via OLLAMA_BASE_URL Validation (backend.rs): - Ollama validation probe sends minimal /api/chat request with stream:false Tests: 4 new tests for pattern detection (ollama chat, tags, show, and GET /api/chat rejection). Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
- Proto: add InferenceModelEntry message with alias/provider/model fields; add repeated models field to ClusterInferenceConfig, Set/Get request/response - Server: add upsert_multi_model_route() for storing multiple model entries under a single route slot; update resolve_route_by_name() to expand multi-model configs into per-alias ResolvedRoute entries - Router: add select_route() with alias-first, protocol-fallback strategy; add model_hint parameter to proxy_with_candidates() variants - Sandbox proxy: extract model field from JSON body as routing hint - Tests: 7 new tests covering select_route, multi-model resolution, and bundle expansion; all 291 existing tests continue to pass Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
- Add --model-alias flag to 'inference set' for multi-model config (e.g. --model-alias gpt=openai/gpt-4 --model-alias claude=anthropic/claude-sonnet-4-20250514) - Add gateway_inference_set_multi() handler in run.rs - Update inference get/print to display multi-model entries - Import InferenceModelEntry proto type in CLI - Fix build_backend_url to always strip /v1 prefix for codex paths - Add /v1/codex/* inference pattern for openai_responses protocol - Fix backend tests to use /v1 endpoint suffix Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
…te guard - Add timeout_secs parameter to gateway_inference_set_multi and pass through to SetClusterInferenceRequest - Add print_timeout to multi-model output display - Add timeout field to router test helper make_route (upstream added timeout to ResolvedRoute) - Add system route guard: upsert_multi_model_route rejects route_name == sandbox-system with InvalidArgument - Add timeout_secs: 0 to multi-model test ClusterInferenceConfig structs - Add upsert_multi_model_route_rejects_system_route test Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>
ab71175 to
d887f04
Compare
|
@pimlock Happy to address any feedback or questions. Let me know if you'd like anything restructured or split differently. |
|
The use of I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow. |
| const OLLAMA_PROTOCOLS: &[&str] = &[ | ||
| "ollama_chat", | ||
| "ollama_model_discovery", | ||
| "openai_chat_completions", | ||
| "openai_completions", | ||
| "model_discovery", | ||
| ]; |
There was a problem hiding this comment.
Is there a reason for using ollama inference protocol, rather than OpenAI one? Is there something extra that ollama supports that cannot be accessed through OpenAI one?
There was a problem hiding this comment.
Ollama exposes native endpoints (/api/chat, /api/tags, /api/show) that provide capabilities not available through its OpenAI-compatible layer:
/api/tags lists all locally available models (no OpenAI equivalent)
/api/show returns model metadata: parameters, template, license, quantization info
/api/chat supports Ollama-specific options like num_ctx, num_predict, temperature variants, and raw mode
The OLLAMA_PROTOCOLS list includes both native and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery), so agents can use either interface. The native protocols are there so tools that use the Ollama client library directly (which targets /api/*) work through inference.local without needing to switch to the OpenAI-compat paths.
If you'd prefer to keep it simpler and only support Ollama through its OpenAI-compat layer, I can drop the native patterns and the ollama_chat/ollama_model_discovery protocols. The tradeoff is that model discovery (/api/tags) and agent tooling that uses the Ollama SDK directly wouldn't work.
Thanks for the feedback. This PR follows the approach outlined in #203 (option B: single record with repeated entries, alias-first selection with protocol fallback, model hint from the request body). I appreciate that was closed off citing the replacement issue #207, but that covers a different concern. System vs user inference is about who the route serves, not how many backends it can reach. This PR already accommodates that split through the system route guard and separate sandbox.inference.local endpoint. On the external proxy: it's a valid pattern, but the overhead feels disproportionate here. This is a static alias lookup table. There's no load balancing, retries, rate limiting, or discovery. The maintenance surface is one function, one proto field, and one server method. For users with 2-3 providers, standing up a separate proxy service is a lot of ceremony for a lookup table. More broadly, my understanding is that NemoClaw/OpenShell is positioned as an enterprise-ready platform for running AI agents securely out of the box. In that context, multi-model access feels like a baseline expectation rather than an edge case. Agents routinely need a fast cheap model for simple tasks and a more capable one for complex reasoning, or a specialised model for specific domains. If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term. If the team has decided this doesn't belong in the embedded proxy, I can scope this down to just the Ollama native API support and Codex pattern matching (commits 1-2) and drop the multi-model routing. Happy to go either way. |
Summary
Adds multi-route inference proxy support, allowing sandboxed agents to reach multiple LLM providers (OpenAI, Anthropic, NVIDIA, Ollama) through a single
inference.localendpoint. Agents select a backend by setting themodelfield to an alias name. Also adds Ollama native API support and Codex URL pattern matching.Related Issue
Closes #203
Changes
InferenceModelEntrymessage (alias,provider_name,model_id); addmodelsrepeated field to set/get request/response messagesupsert_multi_model_route()validates and stores multiple alias→provider mappings; resolves each entry into a separateResolvedRouteat bundle timeselect_route()implements alias-first, protocol-fallback selection;proxy_with_candidates/proxy_with_candidates_streamingaccept optionalmodel_hintmodelfield from request body asmodel_hintfor route selection/v1/codex/*,/api/chat,/api/tags,/api/showinference patternsbuild_backend_url()always strips/v1prefix to support both versioned and non-versioned endpoints (e.g. Codex)OLLAMA_PROFILEprovider profile with native + OpenAI-compat protocols--model-alias ALIAS=PROVIDER/MODELflag (repeatable, conflicts with--provider/--model)inference-routing.mdwith all new sectionsTesting
mise run pre-commitpassesChecklist