Adaptive tool routing for AI agents of any size.
Every AI agent framework presents all tools identically regardless of model size:
| Model | Tools shown | Accuracy | Tokens wasted |
|---|---|---|---|
| 1.5B | All 80 | 50% | 3,400 |
| 35B | All 80 | 88% | 3,400 |
A 1.5B model on a Raspberry Pi receives the same tool descriptions as a 35B model on a GPU server. The small model drowns in options. The model isn't bad at using tools — it's bad at finding them.
Tool selection decomposes into two stages:
P(correct tool) = P(correct family) × P(correct tool | correct family)
The results were surprising:
| Model | P(right family) | P(right tool | family) |
|---|---|---|
| 1.5B | 56% | 89% |
| 9B | 82% | 98% |
| 20B | 84% | 95% |
| 35B | 90% | 98% |
Even a 1.5B model picks the right tool 89% of the time — when it's looking in the right neighborhood. The bottleneck isn't selection, it's navigation.
Adapt the interface, not the model. Different model sizes get different tool presentations:
| Tiny (< 4B) | Large (14-35B) | XL (35B+) | |
|---|---|---|---|
| Strategy | Hybrid | Reorder + hint | Full |
| What model sees | 8 detailed + 72 name-only | All 80, relevance-sorted | All 80, full descriptions |
| file_read | "Read file" path: str |
"Read file with encoding control" path, encoding, lines |
"Read file with line numbers, offset, encoding" path, encoding, lines, offset, limit |
| Accuracy | 60% (+10pp) | 88% (+8pp) | 88% (baseline) |
| Tokens | 97% fewer | same | same |
Benchmarked across 1,000+ native tool calling inference calls (Ollama /api/chat), 4 models, 80 tools, 50 prompts:
| Strategy | 1.5B | 9B | 20B | 35B | Tokens |
|---|---|---|---|---|---|
| Baseline (80 tools, full desc) | 50% | 80% | 80% | 88% | 2,100-5,300 |
| Hybrid (8 detailed + 72 names) | 60% | — | 76% | — | ~1,800 |
| Reorder + hint | 54% | — | 88% | — | same |
| Family oracle (upper bound) | 70% | 86% | 84% | 88% | 400-900 |
Key findings:
- Hybrid works best for tiny models: +10pp accuracy, 97% fewer tokens
- Reorder + hint works best for large models: +8pp, makes 20B match 35B
- No single strategy dominates — optimal presentation is scale-dependent
- Token savings of 83-92% with filtering strategies
- Native tool calling matters — text injection produces different (misleading) results
Full results and analysis in the whitepaper.
pip install yantrikos-sdkfrom yantrikos import BaseTool, ToolResult, Tier, register
@register
class FileReadTool(BaseTool):
name = "file_read"
category = "filesystem"
descriptions = {
Tier.S: "Read file",
Tier.M: "Read a file from disk",
Tier.L: "Read file with encoding control",
Tier.XL: "Read file with line numbers, offset, and encoding",
}
parameters = {
Tier.S: {"path": str},
Tier.M: {"path": str, "encoding": str},
Tier.L: {"path": str, "encoding": str, "line_numbers": bool},
Tier.XL: {"path": str, "encoding": str, "line_numbers": bool,
"offset": int, "limit": int},
}
def execute(self, input: dict, tier: Tier) -> ToolResult:
path = input["path"]
content = open(path).read()
if tier == Tier.S:
return ToolResult.ok(content[:1000])
else:
return ToolResult.ok(content)from yantrikos import TierRouter
# Auto-detects tier from model name
router = TierRouter(model_name="qwen2.5:1.5b") # -> Tier.S, hybrid strategy
native_tools = router.route("Read the file config.yaml")
# Returns Ollama/OpenAI native tool definitions, adapted for 1.5B
router_large = TierRouter(model_name="gpt-4o") # -> Tier.XL, full strategy
native_tools = router_large.route("Read the file config.yaml")
# Returns full tool definitions with all parametersfrom yantrikos import detect_tier
detect_tier("qwen3.5:0.6b") # -> Tier.S
detect_tier("qwen3.5:9b") # -> Tier.M
detect_tier("gpt-oss:20b") # -> Tier.L
detect_tier("claude-opus-4-6") # -> Tier.XLThe SDK parses model names to determine capability:
| Tier | Parameters | Strategy | Max Tools | Format |
|---|---|---|---|---|
| S (Tiny) | < 4B | Hybrid | 8 detailed + rest name-only | Short descriptions, minimal params |
| M (Medium) | 4-14B | Hybrid | 8 detailed + rest name-only | Condensed descriptions |
| L (Large) | 14-35B | Reorder | All tools, relevance-sorted | Full descriptions, category hint |
| XL (X-Large) | 35B+ | Full | All tools | Full descriptions, all params |
Every tool declares behavior per tier — descriptions get shorter, parameters get fewer:
descriptions = {
Tier.S: "Search web", # 10 chars — tiny model focus
Tier.M: "Search the web", # 14 chars
Tier.L: "Search web for info", # 19 chars
Tier.XL: "Search web with filters and date range", # 39 chars
}
parameters = {
Tier.S: {"query": str}, # 1 param
Tier.M: {"query": str, "limit": int}, # 2 params
Tier.L: {"query": str, "limit": int}, # 2 params
Tier.XL: {"query": str, "limit": int, "date": str}, # 3 params
}Tools export as OpenAI/Ollama native format — ready for /api/chat:
from yantrikos import to_native_tool, Tier
native = to_native_tool(my_tool, Tier.S)
# {
# "type": "function",
# "function": {
# "name": "web_search",
# "description": "Search web",
# "parameters": {
# "type": "object",
# "properties": {"query": {"type": "string"}},
# "required": ["query"]
# }
# }
# }The TierRouter selects the best strategy per tier:
Hybrid (Tiny/Medium): Top-K semantically relevant tools get full descriptions. The rest appear as name-only entries. The model focuses on the best candidates but can still pick from the full set.
Reorder (Large): All tools are presented, but sorted by semantic relevance to the query. Most likely tools appear first. Combined with a system prompt category hint for +8pp accuracy.
Full (XL): All tools with full descriptions. Large models don't need adaptation.
Every tool built with the SDK must declare:
name— unique tool identifiercategory— semantic family (filesystem, web, code, data, etc.)descriptions— one per tier, shortest for S, longest for XLparameters— one set per tier, fewest for S, most for XLexecute(input, tier)— tier-aware execution
- Descriptions should be discriminative, not exhaustive. For Tier.S, use the 2-3 words that distinguish this tool from all others.
- First parameter is always the most important one. It's the only one a tiny model sees.
- Categories should be semantically distinct. Don't put CSV tools in both "filesystem" and "data."
- Test at Tier.S. If a 1.5B model can't pick your tool from its short description, rewrite it.
The SDK validates tools at registration:
@register # Raises ToolValidationError if:
class MyTool(BaseTool):
# - name is empty
# - any tier is missing a description
# - Tier.S has more params than Tier.XL
# - descriptions or parameters dict is emptyThe tier architecture originates from YantrikOS, an AI-native desktop OS (under active development) with 116+ tools across 48 categories, designed for models from 0.8B to 35B+. The ModelCapabilityProfile adapts six dimensions: tool count, call format, slot extraction, family routing, context budget, and confidence thresholds.
YantrikOS addresses the family detection bottleneck through discover_tools — a meta-tool that lets models navigate the tool space iteratively with self-correction.
The Tier plugin is available on ClawHub as a code plugin. It integrates the SDK with OpenClaw's gateway, automatically adapting tool presentation based on the configured model.
git clone https://github.com/yantrikos/tier
cd tier
pip install yantrikos-sdk yantrikdb sentence-transformers
python benchmarks/harness_v3.pyRaw results (1,000+ data points): benchmarks/results_v3_full.jsonl
@misc{sarkar2026tier,
author = {Sarkar, Pranab},
title = {Tier-Based Adaptive Tool Routing for Capability-Heterogeneous AI Agents},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19228710},
url = {https://zenodo.org/records/19228710}
}MIT