This repository contains the benchmark suite behind Smithery's blog post, MCP vs CLI Is the Wrong Fight.
The suite compares the same tasks and backends across four surfaces:
- raw API priors
- raw APIs with machine-readable specs
- thin native MCP tools
- generic CLI surfaces
It is designed to measure agent experience, not to prove that one interface always wins.
- Specs help raw API calling: 51.7% to 73.3% success in
api_priors_vs_specs. - Thin native MCP beats direct API use with specs on the same tasks: 55.0% to 91.7% success in
specs_vs_native_mcp. - On the same thin surface, MCP beats the described CLI: 91.7% vs 83.3% success in
native_mcp_vs_cli. - On the full 826-tool GitHub catalog, explicit CLI search closes part of the gap: 66.7% to 87.5%, while native MCP remains at 100.0%.
Experiment families:
api_priors_vs_specsspecs_vs_native_mcpnative_mcp_vs_clicli_topologylinear_graphql_precision_ablationgithub_large_catalog_native_mcp_vs_cligithub_large_catalog_cli_topologygithub_large_catalog_search_affordance
Services:
- GitHub REST
- 24-operation curated slice
- 826-tool frozen catalog for large-catalog discovery tests
- Linear GraphQL
- frozen slice for interface tests
- live read-only rerun for the GraphQL appendix
- public repo ships a redacted task template for the live workspace-specific prompts
- Singapore Bus REST
- 8-operation niche API slice
Models:
- Claude Code Haiku 4.5
- Claude Code Sonnet 4.6 for the large-catalog GitHub experiments
- Codex GPT-5.4
The checked-in public artifacts are sanitized:
bench/results/results.jsonlcontains only the declared 732-run matrix.bench/config/tasks/linear.yamlkeeps the task structure but redacts workspace-specific strings.
python -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"Common commands:
make preflight
make matrix
make smoke
make pilot
make full
make reportUseful docs:
- Method:
docs/method.md - Benchmark guide:
docs/benchmark-guide.md - Current report:
docs/report.md