Skip to content

YoavLax/AgentsProof

Repository files navigation

AgentsProof

Stop guessing. Start proving your agents work.

AgentsProof is an evaluation harness for AI agent plugins — GitHub Copilot, Claude, or any agent framework. It answers one question: does your skill or agent actually make responses better, and by how much?

For each skill or agent, it runs N trials with it injected and N trials without (baseline), grades both arms, and assigns a letter grade based on the measured delta. No browser. No UI automation. No manual review. Just a clean Python CLI and a self-contained HTML report.

  • Works with GitHub Copilot, Claude, and any AI agent
  • Evaluates both skills (system-prompt plugins) and agents (custom agent modes)
  • Grades with deterministic checks (string matching, regex) and LLM rubrics
  • Supports hand-authored test cases via tc-eval: bring your own prompts, expected outputs, and assertions — no YAML generation step needed

Executive summary — Skills and Agents sections with grade tables

How it works

AgentsProof answers: "Does this skill or agent actually improve responses, and by how much?"

For each eval.yaml:

  1. Runs N trials with the skill/agent system prompt injected
  2. Runs N trials without (baseline: generic helpful assistant)
  3. Grades both arms with deterministic checks and/or an LLM rubric
  4. Computes the score delta (with − without) and assigns a letter grade
  5. Saves a self-contained HTML report (stats, per-trial breakdown, improvement suggestions)

Grading scale

Grade Score delta Meaning
A ≥ +0.30 Skill has strong, reliable impact
B ≥ +0.15 Good improvement
C ≥ 0.00 Marginal but positive
F < 0.00 Skill is not helping (or hurting) — always FAIL

Quick start

pip install -e .

# Generate eval.yaml files for every skill and agent in a plugin
agentsproof eval-generate --plugin-dir path/to/my-plugin --trials 3 --model gpt-4o

# Run all evals for a plugin (outputs one HTML report)
agentsproof eval-all evals/my-plugin/ --smoke --ci --model gpt-4o

# Run a single eval
agentsproof eval evals/my-plugin/skills/my-skill/eval.yaml --smoke --model gpt-4o

# Run hand-authored test cases from a JSON file
agentsproof tc-eval path/to/evals.json --trials 2 --model gpt-4o

Commands

eval-generate — Generate eval.yaml files from a plugin

Walks every skills/*/SKILL.md and every agents/*.md under a plugin directory. Uses an LLM to generate one eval.yaml per skill/agent, then copies the source file alongside it so the skill:/agent: reference resolves at runtime.

agentsproof eval-generate \
  --plugin-dir path/to/my-plugin \
  [--out evals/my-plugin] \
  [--model gpt-4o] \
  [--dry-run]
Option Default Description
--plugin-dir required Plugin directory containing skills/ and/or agents/
--out evals/<plugin-name>/ Output directory
--trials 1 Trials per arm to embed in generated eval.yaml files
--model gpt-4o LLM model used for generation
--dry-run false Print what would be written without writing

Output layout:

evals/<plugin-name>/
  skills/<skill-name>/
    eval.yaml
    SKILL.md          ← copied from plugin
  agents/<agent-name>/
    eval.yaml
    <agent-name>.md   ← copied from plugin

eval — Run a single evaluation

agentsproof eval evals/my-plugin/skills/my-skill/eval.yaml \
  --smoke \
  --model gpt-4o \
  [--ci] [--trials N] [--threshold 0.8]
Option Default Description
--smoke 5 trials (quick sanity check)
--reliable 15 trials
--regression 30 trials
--trials N from eval.yaml Override trial count directly
--threshold from eval.yaml Pass rate required (0.0–1.0)
--model AGENTSPROOF_MODEL env LLM model for agent calls
--parallel 1 Concurrent trials
--ci false Exit 1 if below threshold
--output runs/ Output directory

Output: runs/<plugin-name>/<skill>-eval.md, .json, .html


eval-all — Run all evals under a directory

Discovers every eval.yaml recursively (skills and agents) and runs them all. Produces a single HTML report for the whole plugin with:

  • Separate executive summaries for Skills and Agents
  • Collapsible drill-down per skill/agent with trial details
  • Improvement suggestions for passing grades (A/B/C)
agentsproof eval-all evals/my-plugin/ \
  --smoke \
  --ci \
  --model gpt-4o

Options are the same as eval. Exit code is non-zero if any eval is below threshold (with --ci).

Output: runs/<plugin-name>/<plugin-name>-eval.html and -eval-all-summary.md


tc-eval — Run hand-authored test cases

An alternative to eval-generate + eval-all when you want full control over the prompts and expected outcomes. Point it at an evals.json file and it runs the same dual-arm engine (with skill vs without), grading each case and producing the same HTML report.

agentsproof tc-eval path/to/evals.json \
  --trials 2 \
  --model gpt-4o \
  [--smoke | --reliable | --regression] \
  [--ci] [--output runs/]
Option Default Description
--trials N 1 Trials per arm per case
--smoke 5 trials per arm
--reliable 15 trials
--regression 30 trials
--model AGENTSPROOF_MODEL env LLM model
--output runs/ Output directory
--ci false Exit 1 if any case below threshold

Output: runs/<evals-json-stem>/<stem>-tc-eval.html

evals.json format

{
  "skill_name": "path/to/SKILL.md",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data. Can you find the top 3 months by revenue and make a bar chart?",
      "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
      "files": ["data/sales_2025.csv"],
      "assertions": [
        "The output includes a bar chart image file",
        "The chart shows exactly 3 months",
        "Both axes are labeled",
        "The chart title or caption mentions revenue"
      ]
    },
    {
      "id": 2,
      "prompt": "Write a test that checks a search box returns results when 'laptop' is typed.",
      "expected_output": "A test that types into a search input and asserts results are visible."
    },
    {
      "id": 3,
      "prompt": "Create a smoke test that just loads the homepage and checks the title.",
      "assertions": [
        "The response includes code that navigates to a URL",
        "The response checks the page title"
      ]
    }
  ]
}
Field Required Description
skill_name yes Path to the skill .md file (relative to the JSON file)
evals[].id yes Unique integer identifier for the case
evals[].prompt yes The user prompt sent to the LLM
evals[].expected_output no Natural language description of expected output — graded by an LLM rubric (weight 0.3 when combined with assertions)
evals[].assertions no Natural language statements about what the response should contain or do — each evaluated independently by an LLM judge (weight 0.7 when combined with expected_output)
evals[].files no File paths (relative to the JSON file) whose contents are injected into the prompt as context

Grader weight rules:

Graders present Assertion weight Rubric weight
Both assertions and expected_output 0.7 0.3
Only assertions 1.0
Only expected_output 1.0
Neither — (score always 1.0)

How assertion grading works

Each assertion is a plain-English claim about the response. The LLM judge evaluates every assertion independently and returns a pass/fail verdict with a brief evidence quote. Each assertion gets its own fresh LLM session to ensure reliable evaluation.

Example assertion result:

{
  "assertion_results": [
    {
      "text": "The output includes a bar chart image file",
      "passed": true,
      "evidence": "Found chart.png (45KB) in outputs directory"
    },
    {
      "text": "The chart shows exactly 3 months",
      "passed": true,
      "evidence": "Chart displays bars for March, July, and November"
    },
    {
      "text": "Both axes are labeled",
      "passed": false,
      "evidence": "Y-axis is labeled 'Revenue ($)' but X-axis has no label"
    },
    {
      "text": "The chart title or caption mentions revenue",
      "passed": true,
      "evidence": "Chart title reads 'Top 3 Months by Revenue'"
    }
  ],
  "summary": { "passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75 }
}

Write assertions that describe observable outcomes, not implementation details. Good assertions are specific enough to distinguish a correct response from a generic one — a baseline model without the skill should fail them.


eval.yaml format

Skills use skill:, agents use agent: — everything else is identical.

version: "2"
skill: SKILL.md        # for skills   — OR —
# agent: my-agent.md  # for agents

defaults:
  trials: 1            # trials per arm (× 2 total: with + without)
  timeout: 300         # seconds per trial
  threshold: 0.6       # pass rate required in the with-skill arm

task:
  name: my-task-name
  instruction: |
    The natural language prompt sent to the agent.
    Describe a realistic scenario — never mention the skill's techniques.
  graders:
    # Deterministic: fast string-match checks
    - type: deterministic
      checks:
        mustContain:
          - "expected keyword"
        mustNotContain:
          - "forbidden string"
        regexMatch:
          - "pattern-\\d+"
      weight: 0.7

    # LLM rubric: qualitative scoring
    - type: llm_rubric
      rubric: |
        Correctness (0–0.5): Did the response address the core problem?
        Completeness (0–0.5): Were all required steps covered?
      weight: 0.3

Key design rule: The instruction should describe only the problem, never the solution technique. A baseline model should struggle; the skill/agent should make it succeed.

Backward compat: v1 files with tasks: (list) are accepted — the first task is used.


HTML report

Each eval-all run writes one self-contained HTML file. Key sections:

  • Header — plugin name, date, item count, avg delta, overall verdict
  • Skills summary — stats grid + table (total/passed/failed/avg delta) for skills
  • Agents summary — same for agents (only shown if plugin has agents)
  • Drill-down — collapsible section per skill/agent:
    • Score/time/token stats for both arms
    • Per-trial breakdown with grader scores and check results
    • Improvement suggestions (grades A/B/C): which checks still fail, how to tighten the eval
    • Session log (expandable)

Executive summary

Executive summary — Skills and Agents sections with grade tables

Skill drill-down

Skill drill-down — improvement suggestions and per-arm stats

Skill drill-down — WITH / WITHOUT trial cards with grader scores


Graders

Type How it works Best for
deterministic String presence/absence + regex, instant Required keywords, syntax, structure
llm_rubric LLM grades the response 0–1 per rubric criterion Quality, reasoning, completeness

Both contribute to the trial score weighted by weight. Final score = weighted average across graders.

Shorthand: checks: ["item1", "item2"] is equivalent to mustContain: ["item1", "item2"].


Environment variables

Variable Description
AGENTSPROOF_MODEL Default LLM model (overridden by --model)
AGENTSPROOF_PROVIDER_TYPE BYOK provider: openai, azure, ollama
AGENTSPROOF_PROVIDER_URL BYOK base URL
AGENTSPROOF_PROVIDER_KEY BYOK API key

When no BYOK variables are set, AgentsProof uses the GitHub Copilot SDK (requires VS Code with Copilot installed and active).


Repository layout

agentsproof/
  cli.py               # Click CLI — eval, eval-generate, eval-all, tc-eval
  config.py            # Pydantic models: SkillEvalConfig, TaskConfig, GraderConfig
  report.py            # HTML / Markdown / JSON report rendering
  generator/
    core.py            # LLM-based eval.yaml generation (skills + agents)
  executor/
    runner.py          # Dual-arm runner: N trials with-skill vs N without
  llm/
    client.py          # LLMClient (Copilot SDK + BYOK providers)
  pipeline/
    grader/            # deterministic.py, llm_rubric.py, skillgrade.py
  schemas/
    artifacts.py       # SkillEvalResult, ArmResult, TrialResult, compute_grade()
    tc_eval.py         # TCEvalConfig, TCEvalCase — evals.json schema

evals/                 # Evaluation definitions (committed)
  <plugin-name>/
    skills/<skill-name>/
      eval.yaml
      SKILL.md
    agents/<agent-name>/
      eval.yaml
      <agent-name>.md

runs/                  # Generated output (gitignored)
  <plugin-name>/
    <plugin-name>-eval.html        ← eval-all report
    eval-all-summary.md
    eval-all-summary.json
  <evals-json-stem>/
    <stem>-tc-eval.html            ← tc-eval report

skills/
  suite-generator/
    SKILL.md           # System prompt for eval-generate

tests/                 # pytest suite (45 tests)

Running tests

python -m pytest tests/ -q

Exit codes

Code Meaning
0 All evals passed
1 One or more evals failed threshold (with --ci)
30 Error — missing files, SDK unavailable, etc.

Contributing

See CONTRIBUTING.md.


License

MIT © 2025 Yoav Lax


Author

Built by Yoav Lax (@YoavLax).

About

Prove your AI agent plugins actually work. Evaluation harness for Copilot, Claude & any agent framework - grades skills and agents A-F by measuring the response delta with vs without.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages