Skip to content

smithery-ai/mcp-vs-cli-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MCP vs CLI Is the Wrong Fight

This repository contains the benchmark suite behind Smithery's blog post, MCP vs CLI Is the Wrong Fight.

The suite compares the same tasks and backends across four surfaces:

  • raw API priors
  • raw APIs with machine-readable specs
  • thin native MCP tools
  • generic CLI surfaces

It is designed to measure agent experience, not to prove that one interface always wins.

Current Headline Findings

  • Specs help raw API calling: 51.7% to 73.3% success in api_priors_vs_specs.
  • Thin native MCP beats direct API use with specs on the same tasks: 55.0% to 91.7% success in specs_vs_native_mcp.
  • On the same thin surface, MCP beats the described CLI: 91.7% vs 83.3% success in native_mcp_vs_cli.
  • On the full 826-tool GitHub catalog, explicit CLI search closes part of the gap: 66.7% to 87.5%, while native MCP remains at 100.0%.

Scope

Experiment families:

  • api_priors_vs_specs
  • specs_vs_native_mcp
  • native_mcp_vs_cli
  • cli_topology
  • linear_graphql_precision_ablation
  • github_large_catalog_native_mcp_vs_cli
  • github_large_catalog_cli_topology
  • github_large_catalog_search_affordance

Services:

  • GitHub REST
    • 24-operation curated slice
    • 826-tool frozen catalog for large-catalog discovery tests
  • Linear GraphQL
    • frozen slice for interface tests
    • live read-only rerun for the GraphQL appendix
    • public repo ships a redacted task template for the live workspace-specific prompts
  • Singapore Bus REST
    • 8-operation niche API slice

Models:

  • Claude Code Haiku 4.5
  • Claude Code Sonnet 4.6 for the large-catalog GitHub experiments
  • Codex GPT-5.4

The checked-in public artifacts are sanitized:

  • bench/results/results.jsonl contains only the declared 732-run matrix.
  • bench/config/tasks/linear.yaml keeps the task structure but redacts workspace-specific strings.

Quick Start

python -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"

Common commands:

make preflight
make matrix
make smoke
make pilot
make full
make report

Useful docs:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors