Stop guessing. Start proving your agents work.
AgentsProof is an evaluation harness for AI agent plugins — GitHub Copilot, Claude, or any agent framework. It answers one question: does your skill or agent actually make responses better, and by how much?
For each skill or agent, it runs N trials with it injected and N trials without (baseline), grades both arms, and assigns a letter grade based on the measured delta. No browser. No UI automation. No manual review. Just a clean Python CLI and a self-contained HTML report.
- Works with GitHub Copilot, Claude, and any AI agent
- Evaluates both skills (system-prompt plugins) and agents (custom agent modes)
- Grades with deterministic checks (string matching, regex) and LLM rubrics
- Supports hand-authored test cases via
tc-eval: bring your own prompts, expected outputs, and assertions — no YAML generation step needed
AgentsProof answers: "Does this skill or agent actually improve responses, and by how much?"
For each eval.yaml:
- Runs N trials with the skill/agent system prompt injected
- Runs N trials without (baseline: generic helpful assistant)
- Grades both arms with deterministic checks and/or an LLM rubric
- Computes the score delta (with − without) and assigns a letter grade
- Saves a self-contained HTML report (stats, per-trial breakdown, improvement suggestions)
| Grade | Score delta | Meaning |
|---|---|---|
| A | ≥ +0.30 | Skill has strong, reliable impact |
| B | ≥ +0.15 | Good improvement |
| C | ≥ 0.00 | Marginal but positive |
| F | < 0.00 | Skill is not helping (or hurting) — always FAIL |
pip install -e .
# Generate eval.yaml files for every skill and agent in a plugin
agentsproof eval-generate --plugin-dir path/to/my-plugin --trials 3 --model gpt-4o
# Run all evals for a plugin (outputs one HTML report)
agentsproof eval-all evals/my-plugin/ --smoke --ci --model gpt-4o
# Run a single eval
agentsproof eval evals/my-plugin/skills/my-skill/eval.yaml --smoke --model gpt-4o
# Run hand-authored test cases from a JSON file
agentsproof tc-eval path/to/evals.json --trials 2 --model gpt-4oWalks every skills/*/SKILL.md and every agents/*.md under a plugin directory. Uses an LLM to generate one eval.yaml per skill/agent, then copies the source file alongside it so the skill:/agent: reference resolves at runtime.
agentsproof eval-generate \
--plugin-dir path/to/my-plugin \
[--out evals/my-plugin] \
[--model gpt-4o] \
[--dry-run]
| Option | Default | Description |
|---|---|---|
--plugin-dir |
required | Plugin directory containing skills/ and/or agents/ |
--out |
evals/<plugin-name>/ |
Output directory |
--trials |
1 |
Trials per arm to embed in generated eval.yaml files |
--model |
gpt-4o |
LLM model used for generation |
--dry-run |
false | Print what would be written without writing |
Output layout:
evals/<plugin-name>/
skills/<skill-name>/
eval.yaml
SKILL.md ← copied from plugin
agents/<agent-name>/
eval.yaml
<agent-name>.md ← copied from plugin
agentsproof eval evals/my-plugin/skills/my-skill/eval.yaml \
--smoke \
--model gpt-4o \
[--ci] [--trials N] [--threshold 0.8]
| Option | Default | Description |
|---|---|---|
--smoke |
— | 5 trials (quick sanity check) |
--reliable |
— | 15 trials |
--regression |
— | 30 trials |
--trials N |
from eval.yaml | Override trial count directly |
--threshold |
from eval.yaml | Pass rate required (0.0–1.0) |
--model |
AGENTSPROOF_MODEL env |
LLM model for agent calls |
--parallel |
1 |
Concurrent trials |
--ci |
false | Exit 1 if below threshold |
--output |
runs/ |
Output directory |
Output: runs/<plugin-name>/<skill>-eval.md, .json, .html
Discovers every eval.yaml recursively (skills and agents) and runs them all. Produces a single HTML report for the whole plugin with:
- Separate executive summaries for Skills and Agents
- Collapsible drill-down per skill/agent with trial details
- Improvement suggestions for passing grades (A/B/C)
agentsproof eval-all evals/my-plugin/ \
--smoke \
--ci \
--model gpt-4o
Options are the same as eval. Exit code is non-zero if any eval is below threshold (with --ci).
Output: runs/<plugin-name>/<plugin-name>-eval.html and -eval-all-summary.md
An alternative to eval-generate + eval-all when you want full control over the prompts and expected outcomes. Point it at an evals.json file and it runs the same dual-arm engine (with skill vs without), grading each case and producing the same HTML report.
agentsproof tc-eval path/to/evals.json \
--trials 2 \
--model gpt-4o \
[--smoke | --reliable | --regression] \
[--ci] [--output runs/]
| Option | Default | Description |
|---|---|---|
--trials N |
1 |
Trials per arm per case |
--smoke |
— | 5 trials per arm |
--reliable |
— | 15 trials |
--regression |
— | 30 trials |
--model |
AGENTSPROOF_MODEL env |
LLM model |
--output |
runs/ |
Output directory |
--ci |
false | Exit 1 if any case below threshold |
Output: runs/<evals-json-stem>/<stem>-tc-eval.html
{
"skill_name": "path/to/SKILL.md",
"evals": [
{
"id": 1,
"prompt": "I have a CSV of monthly sales data. Can you find the top 3 months by revenue and make a bar chart?",
"expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
"files": ["data/sales_2025.csv"],
"assertions": [
"The output includes a bar chart image file",
"The chart shows exactly 3 months",
"Both axes are labeled",
"The chart title or caption mentions revenue"
]
},
{
"id": 2,
"prompt": "Write a test that checks a search box returns results when 'laptop' is typed.",
"expected_output": "A test that types into a search input and asserts results are visible."
},
{
"id": 3,
"prompt": "Create a smoke test that just loads the homepage and checks the title.",
"assertions": [
"The response includes code that navigates to a URL",
"The response checks the page title"
]
}
]
}| Field | Required | Description |
|---|---|---|
skill_name |
yes | Path to the skill .md file (relative to the JSON file) |
evals[].id |
yes | Unique integer identifier for the case |
evals[].prompt |
yes | The user prompt sent to the LLM |
evals[].expected_output |
no | Natural language description of expected output — graded by an LLM rubric (weight 0.3 when combined with assertions) |
evals[].assertions |
no | Natural language statements about what the response should contain or do — each evaluated independently by an LLM judge (weight 0.7 when combined with expected_output) |
evals[].files |
no | File paths (relative to the JSON file) whose contents are injected into the prompt as context |
Grader weight rules:
| Graders present | Assertion weight | Rubric weight |
|---|---|---|
Both assertions and expected_output |
0.7 | 0.3 |
Only assertions |
1.0 | — |
Only expected_output |
— | 1.0 |
| Neither | — | — (score always 1.0) |
Each assertion is a plain-English claim about the response. The LLM judge evaluates every assertion independently and returns a pass/fail verdict with a brief evidence quote. Each assertion gets its own fresh LLM session to ensure reliable evaluation.
Example assertion result:
{
"assertion_results": [
{
"text": "The output includes a bar chart image file",
"passed": true,
"evidence": "Found chart.png (45KB) in outputs directory"
},
{
"text": "The chart shows exactly 3 months",
"passed": true,
"evidence": "Chart displays bars for March, July, and November"
},
{
"text": "Both axes are labeled",
"passed": false,
"evidence": "Y-axis is labeled 'Revenue ($)' but X-axis has no label"
},
{
"text": "The chart title or caption mentions revenue",
"passed": true,
"evidence": "Chart title reads 'Top 3 Months by Revenue'"
}
],
"summary": { "passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75 }
}Write assertions that describe observable outcomes, not implementation details. Good assertions are specific enough to distinguish a correct response from a generic one — a baseline model without the skill should fail them.
Skills use skill:, agents use agent: — everything else is identical.
version: "2"
skill: SKILL.md # for skills — OR —
# agent: my-agent.md # for agents
defaults:
trials: 1 # trials per arm (× 2 total: with + without)
timeout: 300 # seconds per trial
threshold: 0.6 # pass rate required in the with-skill arm
task:
name: my-task-name
instruction: |
The natural language prompt sent to the agent.
Describe a realistic scenario — never mention the skill's techniques.
graders:
# Deterministic: fast string-match checks
- type: deterministic
checks:
mustContain:
- "expected keyword"
mustNotContain:
- "forbidden string"
regexMatch:
- "pattern-\\d+"
weight: 0.7
# LLM rubric: qualitative scoring
- type: llm_rubric
rubric: |
Correctness (0–0.5): Did the response address the core problem?
Completeness (0–0.5): Were all required steps covered?
weight: 0.3Key design rule: The instruction should describe only the problem, never the solution technique. A baseline model should struggle; the skill/agent should make it succeed.
Backward compat: v1 files with tasks: (list) are accepted — the first task is used.
Each eval-all run writes one self-contained HTML file. Key sections:
- Header — plugin name, date, item count, avg delta, overall verdict
- Skills summary — stats grid + table (total/passed/failed/avg delta) for skills
- Agents summary — same for agents (only shown if plugin has agents)
- Drill-down — collapsible section per skill/agent:
- Score/time/token stats for both arms
- Per-trial breakdown with grader scores and check results
- Improvement suggestions (grades A/B/C): which checks still fail, how to tighten the eval
- Session log (expandable)
| Type | How it works | Best for |
|---|---|---|
deterministic |
String presence/absence + regex, instant | Required keywords, syntax, structure |
llm_rubric |
LLM grades the response 0–1 per rubric criterion | Quality, reasoning, completeness |
Both contribute to the trial score weighted by weight. Final score = weighted average across graders.
Shorthand: checks: ["item1", "item2"] is equivalent to mustContain: ["item1", "item2"].
| Variable | Description |
|---|---|
AGENTSPROOF_MODEL |
Default LLM model (overridden by --model) |
AGENTSPROOF_PROVIDER_TYPE |
BYOK provider: openai, azure, ollama |
AGENTSPROOF_PROVIDER_URL |
BYOK base URL |
AGENTSPROOF_PROVIDER_KEY |
BYOK API key |
When no BYOK variables are set, AgentsProof uses the GitHub Copilot SDK (requires VS Code with Copilot installed and active).
agentsproof/
cli.py # Click CLI — eval, eval-generate, eval-all, tc-eval
config.py # Pydantic models: SkillEvalConfig, TaskConfig, GraderConfig
report.py # HTML / Markdown / JSON report rendering
generator/
core.py # LLM-based eval.yaml generation (skills + agents)
executor/
runner.py # Dual-arm runner: N trials with-skill vs N without
llm/
client.py # LLMClient (Copilot SDK + BYOK providers)
pipeline/
grader/ # deterministic.py, llm_rubric.py, skillgrade.py
schemas/
artifacts.py # SkillEvalResult, ArmResult, TrialResult, compute_grade()
tc_eval.py # TCEvalConfig, TCEvalCase — evals.json schema
evals/ # Evaluation definitions (committed)
<plugin-name>/
skills/<skill-name>/
eval.yaml
SKILL.md
agents/<agent-name>/
eval.yaml
<agent-name>.md
runs/ # Generated output (gitignored)
<plugin-name>/
<plugin-name>-eval.html ← eval-all report
eval-all-summary.md
eval-all-summary.json
<evals-json-stem>/
<stem>-tc-eval.html ← tc-eval report
skills/
suite-generator/
SKILL.md # System prompt for eval-generate
tests/ # pytest suite (45 tests)
python -m pytest tests/ -q| Code | Meaning |
|---|---|
| 0 | All evals passed |
| 1 | One or more evals failed threshold (with --ci) |
| 30 | Error — missing files, SDK unavailable, etc. |
See CONTRIBUTING.md.
Built by Yoav Lax (@YoavLax).


