AgentsProof

Stop guessing. Start proving your agents work.

AgentsProof is an evaluation harness for AI agent plugins — GitHub Copilot, Claude, or any agent framework. It answers one question: does your skill or agent actually make responses better, and by how much?

For each skill or agent, it runs N trials with it injected and N trials without (baseline), grades both arms, and assigns a letter grade based on the measured delta. No browser. No UI automation. No manual review. Just a clean Python CLI and a self-contained HTML report.

Works with GitHub Copilot, Claude, and any AI agent
Evaluates both skills (system-prompt plugins) and agents (custom agent modes)
Grades with deterministic checks (string matching, regex) and LLM rubrics
Supports hand-authored test cases via tc-eval: bring your own prompts, expected outputs, and assertions — no YAML generation step needed

How it works

AgentsProof answers: "Does this skill or agent actually improve responses, and by how much?"

For each eval.yaml:

Runs N trials with the skill/agent system prompt injected
Runs N trials without (baseline: generic helpful assistant)
Grades both arms with deterministic checks and/or an LLM rubric
Computes the score delta (with − without) and assigns a letter grade
Saves a self-contained HTML report (stats, per-trial breakdown, improvement suggestions)

Grading scale

Grade	Score delta	Meaning
A	≥ +0.30	Skill has strong, reliable impact
B	≥ +0.15	Good improvement
C	≥ 0.00	Marginal but positive
F	< 0.00	Skill is not helping (or hurting) — always FAIL

Quick start

pip install -e .

# Generate eval.yaml files for every skill and agent in a plugin
agentsproof eval-generate --plugin-dir path/to/my-plugin --trials 3 --model gpt-4o

# Run all evals for a plugin (outputs one HTML report)
agentsproof eval-all evals/my-plugin/ --smoke --ci --model gpt-4o

# Run a single eval
agentsproof eval evals/my-plugin/skills/my-skill/eval.yaml --smoke --model gpt-4o

# Run hand-authored test cases from a JSON file
agentsproof tc-eval path/to/evals.json --trials 2 --model gpt-4o

Commands

`eval-generate` — Generate eval.yaml files from a plugin

Walks every skills/*/SKILL.md and every agents/*.md under a plugin directory. Uses an LLM to generate one eval.yaml per skill/agent, then copies the source file alongside it so the skill:/agent: reference resolves at runtime.

agentsproof eval-generate \
  --plugin-dir path/to/my-plugin \
  [--out evals/my-plugin] \
  [--model gpt-4o] \
  [--dry-run]

Option	Default	Description
`--plugin-dir`	required	Plugin directory containing `skills/` and/or `agents/`
`--out`	`evals/<plugin-name>/`	Output directory
`--trials`	`1`	Trials per arm to embed in generated eval.yaml files
`--model`	`gpt-4o`	LLM model used for generation
`--dry-run`	false	Print what would be written without writing

Output layout:

evals/<plugin-name>/
  skills/<skill-name>/
    eval.yaml
    SKILL.md          ← copied from plugin
  agents/<agent-name>/
    eval.yaml
    <agent-name>.md   ← copied from plugin

`eval` — Run a single evaluation

agentsproof eval evals/my-plugin/skills/my-skill/eval.yaml \
  --smoke \
  --model gpt-4o \
  [--ci] [--trials N] [--threshold 0.8]

Option	Default	Description
`--smoke`	—	5 trials (quick sanity check)
`--reliable`	—	15 trials
`--regression`	—	30 trials
`--trials N`	from eval.yaml	Override trial count directly
`--threshold`	from eval.yaml	Pass rate required (0.0–1.0)
`--model`	`AGENTSPROOF_MODEL` env	LLM model for agent calls
`--parallel`	`1`	Concurrent trials
`--ci`	false	Exit 1 if below threshold
`--output`	`runs/`	Output directory

Output: runs/<plugin-name>/<skill>-eval.md, .json, .html

`eval-all` — Run all evals under a directory

Discovers every eval.yaml recursively (skills and agents) and runs them all. Produces a single HTML report for the whole plugin with:

Separate executive summaries for Skills and Agents
Collapsible drill-down per skill/agent with trial details
Improvement suggestions for passing grades (A/B/C)

agentsproof eval-all evals/my-plugin/ \
  --smoke \
  --ci \
  --model gpt-4o

Options are the same as eval. Exit code is non-zero if any eval is below threshold (with --ci).

Output: runs/<plugin-name>/<plugin-name>-eval.html and -eval-all-summary.md

`tc-eval` — Run hand-authored test cases

An alternative to eval-generate + eval-all when you want full control over the prompts and expected outcomes. Point it at an evals.json file and it runs the same dual-arm engine (with skill vs without), grading each case and producing the same HTML report.

agentsproof tc-eval path/to/evals.json \
  --trials 2 \
  --model gpt-4o \
  [--smoke | --reliable | --regression] \
  [--ci] [--output runs/]

Option	Default	Description
`--trials N`	`1`	Trials per arm per case
`--smoke`	—	5 trials per arm
`--reliable`	—	15 trials
`--regression`	—	30 trials
`--model`	`AGENTSPROOF_MODEL` env	LLM model
`--output`	`runs/`	Output directory
`--ci`	false	Exit 1 if any case below threshold

Output: runs/<evals-json-stem>/<stem>-tc-eval.html

evals.json format

{
  "skill_name": "path/to/SKILL.md",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data. Can you find the top 3 months by revenue and make a bar chart?",
      "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
      "files": ["data/sales_2025.csv"],
      "assertions": [
        "The output includes a bar chart image file",
        "The chart shows exactly 3 months",
        "Both axes are labeled",
        "The chart title or caption mentions revenue"
      ]
    },
    {
      "id": 2,
      "prompt": "Write a test that checks a search box returns results when 'laptop' is typed.",
      "expected_output": "A test that types into a search input and asserts results are visible."
    },
    {
      "id": 3,
      "prompt": "Create a smoke test that just loads the homepage and checks the title.",
      "assertions": [
        "The response includes code that navigates to a URL",
        "The response checks the page title"
      ]
    }
  ]
}

Field	Required	Description
`skill_name`	yes	Path to the skill `.md` file (relative to the JSON file)
`evals[].id`	yes	Unique integer identifier for the case
`evals[].prompt`	yes	The user prompt sent to the LLM
`evals[].expected_output`	no	Natural language description of expected output — graded by an LLM rubric (weight 0.3 when combined with assertions)
`evals[].assertions`	no	Natural language statements about what the response should contain or do — each evaluated independently by an LLM judge (weight 0.7 when combined with expected_output)
`evals[].files`	no	File paths (relative to the JSON file) whose contents are injected into the prompt as context

Grader weight rules:

Graders present	Assertion weight	Rubric weight
Both `assertions` and `expected_output`	0.7	0.3
Only `assertions`	1.0	—
Only `expected_output`	—	1.0
Neither	—	— (score always 1.0)

How assertion grading works

Each assertion is a plain-English claim about the response. The LLM judge evaluates every assertion independently and returns a pass/fail verdict with a brief evidence quote. Each assertion gets its own fresh LLM session to ensure reliable evaluation.

Example assertion result:

{
  "assertion_results": [
    {
      "text": "The output includes a bar chart image file",
      "passed": true,
      "evidence": "Found chart.png (45KB) in outputs directory"
    },
    {
      "text": "The chart shows exactly 3 months",
      "passed": true,
      "evidence": "Chart displays bars for March, July, and November"
    },
    {
      "text": "Both axes are labeled",
      "passed": false,
      "evidence": "Y-axis is labeled 'Revenue ($)' but X-axis has no label"
    },
    {
      "text": "The chart title or caption mentions revenue",
      "passed": true,
      "evidence": "Chart title reads 'Top 3 Months by Revenue'"
    }
  ],
  "summary": { "passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75 }
}

Write assertions that describe observable outcomes, not implementation details. Good assertions are specific enough to distinguish a correct response from a generic one — a baseline model without the skill should fail them.

eval.yaml format

Skills use skill:, agents use agent: — everything else is identical.

version: "2"
skill: SKILL.md        # for skills   — OR —
# agent: my-agent.md  # for agents

defaults:
  trials: 1            # trials per arm (× 2 total: with + without)
  timeout: 300         # seconds per trial
  threshold: 0.6       # pass rate required in the with-skill arm

task:
  name: my-task-name
  instruction: |
    The natural language prompt sent to the agent.
    Describe a realistic scenario — never mention the skill's techniques.
  graders:
    # Deterministic: fast string-match checks
    - type: deterministic
      checks:
        mustContain:
          - "expected keyword"
        mustNotContain:
          - "forbidden string"
        regexMatch:
          - "pattern-\\d+"
      weight: 0.7

    # LLM rubric: qualitative scoring
    - type: llm_rubric
      rubric: |
        Correctness (0–0.5): Did the response address the core problem?
        Completeness (0–0.5): Were all required steps covered?
      weight: 0.3

Key design rule: The instruction should describe only the problem, never the solution technique. A baseline model should struggle; the skill/agent should make it succeed.

Backward compat: v1 files with tasks: (list) are accepted — the first task is used.

HTML report

Each eval-all run writes one self-contained HTML file. Key sections:

Header — plugin name, date, item count, avg delta, overall verdict
Skills summary — stats grid + table (total/passed/failed/avg delta) for skills
Agents summary — same for agents (only shown if plugin has agents)
Drill-down — collapsible section per skill/agent:
- Score/time/token stats for both arms
- Per-trial breakdown with grader scores and check results
- Improvement suggestions (grades A/B/C): which checks still fail, how to tighten the eval
- Session log (expandable)

Executive summary

Skill drill-down

Graders

Type	How it works	Best for
`deterministic`	String presence/absence + regex, instant	Required keywords, syntax, structure
`llm_rubric`	LLM grades the response 0–1 per rubric criterion	Quality, reasoning, completeness

Both contribute to the trial score weighted by weight. Final score = weighted average across graders.

Shorthand: checks: ["item1", "item2"] is equivalent to mustContain: ["item1", "item2"].

Environment variables

Variable	Description
`AGENTSPROOF_MODEL`	Default LLM model (overridden by `--model`)
`AGENTSPROOF_PROVIDER_TYPE`	BYOK provider: `openai`, `azure`, `ollama`
`AGENTSPROOF_PROVIDER_URL`	BYOK base URL
`AGENTSPROOF_PROVIDER_KEY`	BYOK API key

When no BYOK variables are set, AgentsProof uses the GitHub Copilot SDK (requires VS Code with Copilot installed and active).

Repository layout

agentsproof/
  cli.py               # Click CLI — eval, eval-generate, eval-all, tc-eval
  config.py            # Pydantic models: SkillEvalConfig, TaskConfig, GraderConfig
  report.py            # HTML / Markdown / JSON report rendering
  generator/
    core.py            # LLM-based eval.yaml generation (skills + agents)
  executor/
    runner.py          # Dual-arm runner: N trials with-skill vs N without
  llm/
    client.py          # LLMClient (Copilot SDK + BYOK providers)
  pipeline/
    grader/            # deterministic.py, llm_rubric.py, skillgrade.py
  schemas/
    artifacts.py       # SkillEvalResult, ArmResult, TrialResult, compute_grade()
    tc_eval.py         # TCEvalConfig, TCEvalCase — evals.json schema

evals/                 # Evaluation definitions (committed)
  <plugin-name>/
    skills/<skill-name>/
      eval.yaml
      SKILL.md
    agents/<agent-name>/
      eval.yaml
      <agent-name>.md

runs/                  # Generated output (gitignored)
  <plugin-name>/
    <plugin-name>-eval.html        ← eval-all report
    eval-all-summary.md
    eval-all-summary.json
  <evals-json-stem>/
    <stem>-tc-eval.html            ← tc-eval report

skills/
  suite-generator/
    SKILL.md           # System prompt for eval-generate

tests/                 # pytest suite (45 tests)

Running tests

python -m pytest tests/ -q

Exit codes

Code	Meaning
0	All evals passed
1	One or more evals failed threshold (with `--ci`)
30	Error — missing files, SDK unavailable, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
agentsproof		agentsproof
screenshots		screenshots
skills/suite-generator		skills/suite-generator
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentsProof

How it works

Grading scale

Quick start

Commands

`eval-generate` — Generate eval.yaml files from a plugin

`eval` — Run a single evaluation

`eval-all` — Run all evals under a directory

`tc-eval` — Run hand-authored test cases

evals.json format

How assertion grading works

eval.yaml format

HTML report

Executive summary

Skill drill-down

Graders

Environment variables

Repository layout

Running tests

Exit codes

Contributing

License

Author

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentsProof

How it works

Grading scale

Quick start

Commands

eval-generate — Generate eval.yaml files from a plugin

eval — Run a single evaluation

eval-all — Run all evals under a directory

tc-eval — Run hand-authored test cases

evals.json format

How assertion grading works

eval.yaml format

HTML report

Executive summary

Skill drill-down

Graders

Environment variables

Repository layout

Running tests

Exit codes

Contributing

License

Author

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`eval-generate` — Generate eval.yaml files from a plugin

`eval` — Run a single evaluation

`eval-all` — Run all evals under a directory

`tc-eval` — Run hand-authored test cases