feat(inference): multi-route proxy with alias-based model routing by cosmicnet · Pull Request #618 · NVIDIA/OpenShell

cosmicnet · 2026-03-25T23:52:08Z

Summary

Adds multi-route inference proxy support, allowing sandboxed agents to reach multiple LLM providers (OpenAI, Anthropic, NVIDIA, Ollama) through a single inference.local endpoint. Agents select a backend by setting the model field to an alias name. Also adds Ollama native API support and Codex URL pattern matching.

Related Issue

Closes #203

Changes

Proto: Add InferenceModelEntry message (alias, provider_name, model_id); add models repeated field to set/get request/response messages
Server: upsert_multi_model_route() validates and stores multiple alias→provider mappings; resolves each entry into a separate ResolvedRoute at bundle time
Router: select_route() implements alias-first, protocol-fallback selection; proxy_with_candidates/proxy_with_candidates_streaming accept optional model_hint
Sandbox proxy: Extracts model field from request body as model_hint for route selection
Sandbox L7: Add /v1/codex/*, /api/chat, /api/tags, /api/show inference patterns
Backend: build_backend_url() always strips /v1 prefix to support both versioned and non-versioned endpoints (e.g. Codex)
Core: Add OLLAMA_PROFILE provider profile with native + OpenAI-compat protocols
CLI: --model-alias ALIAS=PROVIDER/MODEL flag (repeatable, conflicts with --provider/--model)
Architecture docs: Updated inference-routing.md with all new sections

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

github-actions · 2026-03-25T23:52:18Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

cosmicnet · 2026-03-25T23:53:15Z

I have read the DCO document and I hereby sign the DCO.

Copilot

Pull request overview

Adds multi-route inference proxying so sandboxes can route inference.local requests to multiple LLM backends by using a model alias in the request body.

Changes:

Extends the inference proto + gateway storage to support multiple (alias, provider_name, model_id) entries per route.
Adds alias-first route selection in the router and passes a model_hint extracted from sandbox request bodies.
Expands sandbox L7 inference patterns and adds an Ollama provider profile + endpoint validation probe.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
proto/inference.proto	Adds `InferenceModelEntry` and `models` fields for multi-model inference config.
crates/openshell-server/src/inference.rs	Implements multi-model upsert + resolves each alias into separate `ResolvedRoute` entries.
crates/openshell-sandbox/src/proxy.rs	Extracts `model` from JSON body and forwards it as `model_hint` to the router.
crates/openshell-sandbox/src/l7/inference.rs	Adds Codex + Ollama native API patterns and tests.
crates/openshell-router/src/lib.rs	Adds `select_route()` and extends proxy APIs to accept `model_hint`.
crates/openshell-router/src/backend.rs	Adds Ollama validation probe and changes backend URL construction behavior.
crates/openshell-router/tests/backend_integration.rs	Updates tests for new proxy function signatures and `/v1` endpoint expectations.
crates/openshell-core/src/inference.rs	Adds `OLLAMA_PROFILE` (protocols/base URL/config keys).
crates/openshell-cli/src/run.rs	Adds `gateway_inference_set_multi()` to send multi-model configs.
crates/openshell-cli/src/main.rs	Adds `--model-alias ALIAS=PROVIDER/MODEL` CLI flag and dispatch.
architecture/inference-routing.md	Documents alias-based route selection, new patterns, and multi-model route behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/openshell-server/src/inference.rs

crates/openshell-router/src/lib.rs

crates/openshell-router/src/backend.rs

Add pattern detection, provider profile, and validation probe for Ollama's native /api/chat, /api/tags, and /api/show endpoints. Proxy changes (l7/inference.rs): - POST /api/chat -> ollama_chat protocol - GET /api/tags -> ollama_model_discovery protocol - POST /api/show -> ollama_model_discovery protocol Provider profile (openshell-core/inference.rs): - New 'ollama' provider type with default endpoint http://host.openshell.internal:11434 - Supports ollama_chat, ollama_model_discovery, and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery) - Credential lookup via OLLAMA_API_KEY, base URL via OLLAMA_BASE_URL Validation (backend.rs): - Ollama validation probe sends minimal /api/chat request with stream:false Tests: 4 new tests for pattern detection (ollama chat, tags, show, and GET /api/chat rejection). Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

- Proto: add InferenceModelEntry message with alias/provider/model fields; add repeated models field to ClusterInferenceConfig, Set/Get request/response - Server: add upsert_multi_model_route() for storing multiple model entries under a single route slot; update resolve_route_by_name() to expand multi-model configs into per-alias ResolvedRoute entries - Router: add select_route() with alias-first, protocol-fallback strategy; add model_hint parameter to proxy_with_candidates() variants - Sandbox proxy: extract model field from JSON body as routing hint - Tests: 7 new tests covering select_route, multi-model resolution, and bundle expansion; all 291 existing tests continue to pass Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

- Add --model-alias flag to 'inference set' for multi-model config (e.g. --model-alias gpt=openai/gpt-4 --model-alias claude=anthropic/claude-sonnet-4-20250514) - Add gateway_inference_set_multi() handler in run.rs - Update inference get/print to display multi-model entries - Import InferenceModelEntry proto type in CLI - Fix build_backend_url to always strip /v1 prefix for codex paths - Add /v1/codex/* inference pattern for openai_responses protocol - Fix backend tests to use /v1 endpoint suffix Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

…te guard - Add timeout_secs parameter to gateway_inference_set_multi and pass through to SetClusterInferenceRequest - Add print_timeout to multi-model output display - Add timeout field to router test helper make_route (upstream added timeout to ResolvedRoute) - Add system route guard: upsert_multi_model_route rejects route_name == sandbox-system with InvalidArgument - Add timeout_secs: 0 to multi-model test ClusterInferenceConfig structs - Add upsert_multi_model_route_rejects_system_route test Signed-off-by: Lyle Hopkins <lyle@cosmicnetworks.com>

cosmicnet · 2026-04-01T22:02:13Z

@pimlock Happy to address any feedback or questions. Let me know if you'd like anything restructured or split differently.

johntmyers · 2026-04-03T23:20:42Z

The use of inference.local was to provide a default model at a minimum for all sandboxes to have access to (if configured). We're cautious not to bloat the embedded sandbox inference router to support arbitrary upstream providers and models and also turn this into a larger scale model router that will have consistent maintenance. We're still determining what level of routing support we should have on our roadmap.

I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow.

pimlock · 2026-04-06T17:39:04Z

crates/openshell-core/src/inference.rs

+const OLLAMA_PROTOCOLS: &[&str] = &[
+    "ollama_chat",
+    "ollama_model_discovery",
+    "openai_chat_completions",
+    "openai_completions",
+    "model_discovery",
+];


Is there a reason for using ollama inference protocol, rather than OpenAI one? Is there something extra that ollama supports that cannot be accessed through OpenAI one?

Ollama exposes native endpoints (/api/chat, /api/tags, /api/show) that provide capabilities not available through its OpenAI-compatible layer:

/api/tags lists all locally available models (no OpenAI equivalent)
/api/show returns model metadata: parameters, template, license, quantization info
/api/chat supports Ollama-specific options like num_ctx, num_predict, temperature variants, and raw mode
The OLLAMA_PROTOCOLS list includes both native and OpenAI-compatible protocols (openai_chat_completions, openai_completions, model_discovery), so agents can use either interface. The native protocols are there so tools that use the Ollama client library directly (which targets /api/*) work through inference.local without needing to switch to the OpenAI-compat paths.

If you'd prefer to keep it simpler and only support Ollama through its OpenAI-compat layer, I can drop the native patterns and the ollama_chat/ollama_model_discovery protocols. The tradeoff is that model discovery (/api/tags) and agent tooling that uses the Ollama SDK directly wouldn't work.

cosmicnet · 2026-04-06T21:33:53Z

The use of inference.local was to provide a default model at a minimum for all sandboxes to have access to (if configured). We're cautious not to bloat the embedded sandbox inference router to support arbitrary upstream providers and models and also turn this into a larger scale model router that will have consistent maintenance. We're still determining what level of routing support we should have on our roadmap.

I am curious, if you need such level of routing support, have you considered setting up a dedicated proxy/router that is accessible outside of the sandbox and just configuring access to it with network policies? This is a typical pattern we have several users follow.

Thanks for the feedback. This PR follows the approach outlined in #203 (option B: single record with repeated entries, alias-first selection with protocol fallback, model hint from the request body). I appreciate that was closed off citing the replacement issue #207, but that covers a different concern. System vs user inference is about who the route serves, not how many backends it can reach. This PR already accommodates that split through the system route guard and separate sandbox.inference.local endpoint.

On the external proxy: it's a valid pattern, but the overhead feels disproportionate here. This is a static alias lookup table. There's no load balancing, retries, rate limiting, or discovery. The maintenance surface is one function, one proto field, and one server method. For users with 2-3 providers, standing up a separate proxy service is a lot of ceremony for a lookup table.

More broadly, my understanding is that NemoClaw/OpenShell is positioned as an enterprise-ready platform for running AI agents securely out of the box. In that context, multi-model access feels like a baseline expectation rather than an edge case. Agents routinely need a fast cheap model for simple tasks and a more capable one for complex reasoning, or a specialised model for specific domains. If each of those requires its own external proxy and network policy, that's a significant barrier to the "out of the box" experience. Maybe I'm misunderstanding the intended scope, but it's hard to see how single-model inference serves that use case long term.

If the team has decided this doesn't belong in the embedded proxy, I can scope this down to just the Ollama native API support and Codex pattern matching (commits 1-2) and drop the multi-model routing. Happy to go either way.

cosmicnet requested a review from a team as a code owner March 25, 2026 23:52

Copilot AI review requested due to automatic review settings March 25, 2026 23:52

Copilot started reviewing on behalf of cosmicnet March 25, 2026 23:52 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

crates/openshell-server/src/inference.rs Show resolved Hide resolved

crates/openshell-server/src/inference.rs Show resolved Hide resolved

crates/openshell-router/src/lib.rs Outdated Show resolved Hide resolved

crates/openshell-router/src/backend.rs Outdated Show resolved Hide resolved

cosmicnet force-pushed the 203-multi-route-inference/lh branch from af1748b to ab71175 Compare March 26, 2026 00:36

pimlock self-assigned this Mar 30, 2026

cosmicnet added 4 commits March 31, 2026 02:13

cosmicnet force-pushed the 203-multi-route-inference/lh branch from ab71175 to d887f04 Compare April 1, 2026 19:44

pimlock reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): multi-route proxy with alias-based model routing#618

feat(inference): multi-route proxy with alias-based model routing#618
cosmicnet wants to merge 4 commits intoNVIDIA:mainfrom
cosmicnet:203-multi-route-inference/lh

cosmicnet commented Mar 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

cosmicnet commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cosmicnet commented Apr 1, 2026

Uh oh!

johntmyers commented Apr 3, 2026

Uh oh!

pimlock Apr 6, 2026

Uh oh!

cosmicnet Apr 6, 2026

Uh oh!

cosmicnet commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cosmicnet commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

github-actions bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicnet commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cosmicnet commented Apr 1, 2026

Uh oh!

johntmyers commented Apr 3, 2026

Uh oh!

pimlock Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

cosmicnet Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

cosmicnet commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cosmicnet commented Mar 25, 2026 •

edited

Loading

github-actions bot commented Mar 25, 2026 •

edited

Loading