Skip to content

feat(skills): add agent-eval for head-to-head coding agent comparison#540

Open
joaquinhuigomez wants to merge 3 commits intoaffaan-m:mainfrom
joaquinhuigomez:feat/agent-eval-skill
Open

feat(skills): add agent-eval for head-to-head coding agent comparison#540
joaquinhuigomez wants to merge 3 commits intoaffaan-m:mainfrom
joaquinhuigomez:feat/agent-eval-skill

Conversation

@joaquinhuigomez
Copy link

@joaquinhuigomez joaquinhuigomez commented Mar 16, 2026

Summary

Adds agent-eval as a skill — a lightweight CLI tool for comparing coding agents (Claude Code, Aider, Codex, etc.) head-to-head on custom tasks using YAML task definitions, git worktree isolation, and deterministic + model-based judges.

Type

  • Skill
  • Agent
  • Hook
  • Command

What it covers

  • YAML task definitions with judge criteria (pytest, grep, LLM-as-judge)
  • Git worktree isolation per run (no Docker needed)
  • Metrics: pass rate, cost, time, consistency (pass@k)
  • Comparison report generation

Testing

  • Tested with Claude Code locally
  • 39 tests passing in the agent-eval repo
  • Follows SKILL.md template format

Checklist

  • Follows format guidelines
  • Tested with Claude Code
  • No sensitive info (API keys, paths)
  • Clear descriptions

Summary by cubic

Adds the agent-eval skill, a lightweight CLI to compare coding agents on reproducible tasks and report pass rate, cost, time, and consistency. Helps teams make data-backed choices and run regressions on their own codebases.

  • New Features
    • Adds skills/agent-eval/SKILL.md with activation steps, pinned pip install to a commit, workflow, and examples.
    • Declarative task YAML with judges (tests/commands, pattern checks, model-based review) and a commit pin for reproducibility.
    • Per-run git worktree reproducibility isolation (no Docker).
    • Comparison reports with multi-run metrics: pass rate across repeated runs, cost, and time; plus doc polish per review (removed duplicate section, set origin to ECC).

Written for commit 82e304c. Summary will update on new commits.

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive guide for the agent-eval skill describing a head-to-head agent evaluation workflow.
    • Includes activation and installation steps, task definition formats, isolation and metrics concepts, step-by-step workflow for running and comparing agents, judge types (code-, pattern-, and model-based), best practices, and repository/configuration examples.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19578590-7658-4df5-b942-7516fafe9235

📥 Commits

Reviewing files that changed from the base of the PR and between 79e63b2 and 82e304c.

📒 Files selected for processing (1)
  • skills/agent-eval/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • skills/agent-eval/SKILL.md

📝 Walkthrough

Walkthrough

Adds a new Markdown documentation file that describes a head-to-head agent evaluation workflow, covering activation, installation, YAML task definitions, git worktree isolation, metrics, judge types, workflow steps, best practices, and example repository links. (48 words)

Changes

Cohort / File(s) Summary
Agent Evaluation Documentation
skills/agent-eval/SKILL.md
New comprehensive guide for head‑to‑head agent evaluation: activation/installation, YAML task definitions, git worktree isolation, metrics (pass rate, cost, time, consistency), judge types (code/pattern/model), workflow steps, best practices, and example links.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • affaan-m

Poem

🐰 I hopped a doc into the glen,
Tasks in YAML, worktrees then,
Judges tally, metrics gleam,
Agents race in a testing dream —
I nibble carrots, cheer the team! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding agent-eval as a new skill for comparing coding agents head-to-head.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR adds skills/agent-eval/SKILL.md, a new skill document that guides users through installing and operating the agent-eval CLI — a lightweight tool for comparing coding agents (Claude Code, Aider, Codex, etc.) on custom YAML-defined tasks with git worktree isolation and deterministic + model-based judges.

All issues raised in the previous review round have been resolved: origin is now ECC, an ## Installation section is present, the canonical YAML example includes the commit pin, the duplicate "When to Use" section has been removed, and the install command is pinned to a specific commit SHA.

Remaining notes:

  • The install comment labels a raw commit SHA as v0.1.0, which is misleading if no actual v0.1.0 git tag exists on the upstream repo — using @v0.1.0 (or correcting the label) would be clearer.
  • The llm judge section shows the YAML syntax but omits any information about which LLM provider is used, how to set API keys, or how to select the model, leaving users without enough context to actually use that judge type.

Confidence Score: 4/5

  • Documentation-only PR that is safe to merge; two minor style gaps remain but do not block usage.
  • All blocking issues from the prior review round have been addressed. The two remaining points (misleading version label and undocumented LLM judge configuration) are non-critical style/documentation gaps that do not affect correctness or safety.
  • No files require special attention beyond the two inline comments already noted.

Important Files Changed

Filename Overview
skills/agent-eval/SKILL.md New skill documentation for the agent-eval CLI. All previous review concerns (origin, installation, commit pin, duplicate section) have been addressed. Two minor style gaps remain: the version label in the install command is potentially misleading (SHA labelled as v0.1.0 rather than using an actual tag), and the LLM-as-judge section omits configuration details (provider, API key, model selection).

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as agent-eval CLI
    participant Worktree as Git Worktree
    participant Agent as Coding Agent
    participant Judge as Judge (pytest/grep/llm)
    participant Report as Report Generator

    User->>CLI: agent-eval run --task task.yaml --agent claude-code --runs 3
    loop For each run (1..k)
        CLI->>Worktree: Create fresh worktree from pinned commit
        CLI->>Agent: Hand off task prompt + files
        Agent-->>Worktree: Write code changes
        CLI->>Judge: Run judge criteria (pytest / grep / llm)
        Judge-->>CLI: pass/fail, cost, time
        CLI->>Worktree: Tear down worktree
    end
    CLI-->>User: Results stored

    User->>CLI: agent-eval report --format table
    CLI->>Report: Aggregate pass rate, cost, time, consistency
    Report-->>User: Comparison table (pass@k per agent)
Loading

Last reviewed commit: 82e304c

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/agent-eval/SKILL.md`:
- Around line 55-147: Add explicit "How It Works" and "Examples" headings to
match the skill format: under the existing "Workflow" content insert a "How It
Works" heading that briefly summarizes the three-step flow (Define Tasks, Run
Agents, Compare Results) and references the steps already shown, and create an
"Examples" heading that contains the concrete CLI examples (the agent-eval run
command and the agent-eval report command plus the ASCII table output) so
readers can find runnable examples; update the document near the "Workflow" and
before "Judge Types"/"Best Practices" and ensure the headings are exactly "How
It Works" and "Examples" to satisfy the guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 32be15e6-66d7-4636-b9bc-a8c5e84f65f3

📥 Commits

Reviewing files that changed from the base of the PR and between 7cf07ca and c7f0def.

📒 Files selected for processing (1)
  • skills/agent-eval/SKILL.md

Comment on lines +55 to +147
## Workflow

### 1. Define Tasks

Create a `tasks/` directory with YAML files, one per task:

```bash
mkdir tasks
# Write task definitions (see template above)
```

### 2. Run Agents

Execute agents against your tasks:

```bash
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
```

Each run:
1. Creates a fresh git worktree from the specified commit
2. Hands the prompt to the agent
3. Runs the judge criteria
4. Records pass/fail, cost, and time

### 3. Compare Results

Generate a comparison report:

```bash
agent-eval report --format table
```

```
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ pass@3 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ pass@3 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘
```

## Judge Types

### Code-Based (deterministic)

```yaml
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
```

### Pattern-Based

```yaml
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
```

### Model-Based (LLM-as-judge)

```yaml
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.
```

## Best Practices

- **Start with 3-5 tasks** that represent your real workload, not toy examples
- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic
- **Pin the commit** in your task YAML so results are reproducible across days/weeks
- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise
- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice
- **Version your task definitions** — they are test fixtures, treat them as code

## When to Use

- Evaluating whether to switch from one coding agent to another
- Benchmarking a new model release against your baseline
- Building internal evidence for tool adoption decisions
- Running periodic regression checks on agent quality

## Links

- Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add explicit How It Works and Examples section headings to match skill format requirements.

The content already covers both, but the required section titles are not explicitly present.

✏️ Suggested doc-structure tweak
-## Workflow
+## How It Works
...
-## Judge Types
+## Examples
+
+### Judge Types
...
+### CLI Usage Examples

As per coding guidelines, "Skills should be formatted as Markdown with clear sections for When to Use, How It Works, and Examples."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/agent-eval/SKILL.md` around lines 55 - 147, Add explicit "How It
Works" and "Examples" headings to match the skill format: under the existing
"Workflow" content insert a "How It Works" heading that briefly summarizes the
three-step flow (Define Tasks, Run Agents, Compare Results) and references the
steps already shown, and create an "Examples" heading that contains the concrete
CLI examples (the agent-eval run command and the agent-eval report command plus
the ASCII table output) so readers can find runnable examples; update the
document near the "Workflow" and before "Judge Types"/"Best Practices" and
ensure the headings are exactly "How It Works" and "Examples" to satisfy the
guideline.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/agent-eval/SKILL.md">

<violation number="1" location="skills/agent-eval/SKILL.md:4">
P2: Use the repository’s standard `origin` value for skills to avoid downstream tooling misclassification.</violation>

<violation number="2" location="skills/agent-eval/SKILL.md:28">
P3: Include a `commit` field in the canonical task YAML example to align with the reproducibility guidance; otherwise users are likely to run non-reproducible evaluations from the default example.</violation>

<violation number="3" location="skills/agent-eval/SKILL.md:44">
P2: The documentation overstates git worktree isolation as a security boundary, which can mislead users into unsafe assumptions.</violation>

<violation number="4" location="skills/agent-eval/SKILL.md:53">
P2: The document mislabels a per-run pass fraction as pass@k, producing mathematically incorrect metric semantics.</violation>

<violation number="5" location="skills/agent-eval/SKILL.md:147">
P1: User-facing skill links to an unvetted external repository, creating supply-chain and review-boundary risk.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

…kill

- Remove duplicate "When to Use" section (kept "When to Activate")
- Add Installation section with pip install instructions
- Change origin from "community" to "ECC" per repo convention
- Add commit field to YAML task example for reproducibility
- Fix pass@k mislabeling to "pass rate across repeated runs"
- Soften worktree isolation language to "reproducibility isolation"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/agent-eval/SKILL.md">

<violation number="1" location="skills/agent-eval/SKILL.md:22">
P1: Installation docs use an unpinned external GitHub VCS install, creating reproducibility and supply-chain risk.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Address PR review feedback: pin the VCS install to commit
6d062a2 to avoid supply-chain risk from unpinned external deps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/agent-eval/SKILL.md">

<violation number="1" location="skills/agent-eval/SKILL.md:23">
P1: User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


```bash
# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/agent-eval/SKILL.md, line 23:

<comment>User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.</comment>

<file context>
@@ -19,7 +19,8 @@ A lightweight CLI tool for comparing coding agents head-to-head on reproducible
 ```bash
-pip install git+https://github.com/joaquinhuigomez/agent-eval.git
+# pinned to v0.1.0 — latest stable commit
+pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

</file context>


</details>

<a href="https://www.cubic.dev/action/fix/violation/b84d5ce3-68b1-43f2-921c-c368bae24585" target="_blank" rel="noopener noreferrer" data-no-image-dialog="true">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://cubic.dev/buttons/fix-with-cubic-dark.svg">
    <source media="(prefers-color-scheme: light)" srcset="https://cubic.dev/buttons/fix-with-cubic-light.svg">
    <img alt="Fix with Cubic" src="https://cubic.dev/buttons/fix-with-cubic-dark.svg">
  </picture>
</a>

Comment on lines +22 to +23
# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading version label in install comment

The comment says # pinned to v0.1.0 — latest stable commit but the install command references a raw commit SHA rather than a v0.1.0 git tag. If a v0.1.0 tag exists in the upstream repo, using the named tag is more readable, more transparent, and lets users immediately see what version they're installing:

Suggested change
# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
# pinned to v0.1.0 — latest stable release
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@v0.1.0

If v0.1.0 is not an actual tag on the upstream repo and the SHA was simply labelled "v0.1.0" for convenience, the comment is misleading. Either push and reference a proper tag, or update the comment to clarify it is a specific commit (e.g. # pinned to commit 6d062a2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants