feat(skills): add agent-eval for head-to-head coding agent comparison by joaquinhuigomez · Pull Request #540 · affaan-m/everything-claude-code

joaquinhuigomez · 2026-03-16T22:06:05Z

Summary

Adds agent-eval as a skill — a lightweight CLI tool for comparing coding agents (Claude Code, Aider, Codex, etc.) head-to-head on custom tasks using YAML task definitions, git worktree isolation, and deterministic + model-based judges.

Type

Skill
Agent
Hook
Command

What it covers

YAML task definitions with judge criteria (pytest, grep, LLM-as-judge)
Git worktree isolation per run (no Docker needed)
Metrics: pass rate, cost, time, consistency (pass@k)
Comparison report generation

Testing

Tested with Claude Code locally
39 tests passing in the agent-eval repo
Follows SKILL.md template format

Checklist

Follows format guidelines
Tested with Claude Code
No sensitive info (API keys, paths)
Clear descriptions

Summary by cubic

Adds the agent-eval skill, a lightweight CLI to compare coding agents on reproducible tasks and report pass rate, cost, time, and consistency. Helps teams make data-backed choices and run regressions on their own codebases.

New Features
- Adds skills/agent-eval/SKILL.md with activation steps, pinned pip install to a commit, workflow, and examples.
- Declarative task YAML with judges (tests/commands, pattern checks, model-based review) and a commit pin for reproducibility.
- Per-run git worktree reproducibility isolation (no Docker).
- Comparison reports with multi-run metrics: pass rate across repeated runs, cost, and time; plus doc polish per review (removed duplicate section, set origin to ECC).

^{Written for commit 82e304c. Summary will update on new commits.}

Summary by CodeRabbit

Documentation
- Added a comprehensive guide for the agent-eval skill describing a head-to-head agent evaluation workflow.
- Includes activation and installation steps, task definition formats, isolation and metrics concepts, step-by-step workflow for running and comparing agents, judge types (code-, pattern-, and model-based), best practices, and repository/configuration examples.

coderabbitai · 2026-03-16T22:06:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19578590-7658-4df5-b942-7516fafe9235

📥 Commits

Reviewing files that changed from the base of the PR and between 79e63b2 and 82e304c.

📒 Files selected for processing (1)

skills/agent-eval/SKILL.md

🚧 Files skipped from review as they are similar to previous changes (1)

skills/agent-eval/SKILL.md

📝 Walkthrough

Walkthrough

Adds a new Markdown documentation file that describes a head-to-head agent evaluation workflow, covering activation, installation, YAML task definitions, git worktree isolation, metrics, judge types, workflow steps, best practices, and example repository links. (48 words)

Changes

Cohort / File(s)	Summary
Agent Evaluation Documentation `skills/agent-eval/SKILL.md`	New comprehensive guide for head‑to‑head agent evaluation: activation/installation, YAML task definitions, git worktree isolation, metrics (pass rate, cost, time, consistency), judge types (code/pattern/model), workflow steps, best practices, and example links.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

affaan-m

Poem

🐰 I hopped a doc into the glen,
Tasks in YAML, worktrees then,
Judges tally, metrics gleam,
Agents race in a testing dream —
I nibble carrots, cheer the team! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding agent-eval as a new skill for comparing coding agents head-to-head.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-16T22:08:49Z

Greptile Summary

This PR adds skills/agent-eval/SKILL.md, a new skill document that guides users through installing and operating the agent-eval CLI — a lightweight tool for comparing coding agents (Claude Code, Aider, Codex, etc.) on custom YAML-defined tasks with git worktree isolation and deterministic + model-based judges.

All issues raised in the previous review round have been resolved: origin is now ECC, an ## Installation section is present, the canonical YAML example includes the commit pin, the duplicate "When to Use" section has been removed, and the install command is pinned to a specific commit SHA.

Remaining notes:

The install comment labels a raw commit SHA as v0.1.0, which is misleading if no actual v0.1.0 git tag exists on the upstream repo — using @v0.1.0 (or correcting the label) would be clearer.
The llm judge section shows the YAML syntax but omits any information about which LLM provider is used, how to set API keys, or how to select the model, leaving users without enough context to actually use that judge type.

Confidence Score: 4/5

Documentation-only PR that is safe to merge; two minor style gaps remain but do not block usage.
All blocking issues from the prior review round have been addressed. The two remaining points (misleading version label and undocumented LLM judge configuration) are non-critical style/documentation gaps that do not affect correctness or safety.
No files require special attention beyond the two inline comments already noted.

Important Files Changed

Filename	Overview
skills/agent-eval/SKILL.md	New skill documentation for the agent-eval CLI. All previous review concerns (origin, installation, commit pin, duplicate section) have been addressed. Two minor style gaps remain: the version label in the install command is potentially misleading (SHA labelled as v0.1.0 rather than using an actual tag), and the LLM-as-judge section omits configuration details (provider, API key, model selection).

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as agent-eval CLI
    participant Worktree as Git Worktree
    participant Agent as Coding Agent
    participant Judge as Judge (pytest/grep/llm)
    participant Report as Report Generator

    User->>CLI: agent-eval run --task task.yaml --agent claude-code --runs 3
    loop For each run (1..k)
        CLI->>Worktree: Create fresh worktree from pinned commit
        CLI->>Agent: Hand off task prompt + files
        Agent-->>Worktree: Write code changes
        CLI->>Judge: Run judge criteria (pytest / grep / llm)
        Judge-->>CLI: pass/fail, cost, time
        CLI->>Worktree: Tear down worktree
    end
    CLI-->>User: Results stored

    User->>CLI: agent-eval report --format table
    CLI->>Report: Aggregate pass rate, cost, time, consistency
    Report-->>User: Comparison table (pass@k per agent)

_{Last reviewed commit: 82e304c}

skills/agent-eval/SKILL.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/agent-eval/SKILL.md`:
- Around line 55-147: Add explicit "How It Works" and "Examples" headings to
match the skill format: under the existing "Workflow" content insert a "How It
Works" heading that briefly summarizes the three-step flow (Define Tasks, Run
Agents, Compare Results) and references the steps already shown, and create an
"Examples" heading that contains the concrete CLI examples (the agent-eval run
command and the agent-eval report command plus the ASCII table output) so
readers can find runnable examples; update the document near the "Workflow" and
before "Judge Types"/"Best Practices" and ensure the headings are exactly "How
It Works" and "Examples" to satisfy the guideline.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 32be15e6-66d7-4636-b9bc-a8c5e84f65f3

📥 Commits

Reviewing files that changed from the base of the PR and between 7cf07ca and c7f0def.

📒 Files selected for processing (1)

skills/agent-eval/SKILL.md

coderabbitai · 2026-03-16T22:09:04Z

skills/agent-eval/SKILL.md

+## Workflow
+
+### 1. Define Tasks
+
+Create a `tasks/` directory with YAML files, one per task:
+
+```bash
+mkdir tasks
+# Write task definitions (see template above)
+```
+
+### 2. Run Agents
+
+Execute agents against your tasks:
+
+```bash
+agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
+```
+
+Each run:
+1. Creates a fresh git worktree from the specified commit
+2. Hands the prompt to the agent
+3. Runs the judge criteria
+4. Records pass/fail, cost, and time
+
+### 3. Compare Results
+
+Generate a comparison report:
+
+```bash
+agent-eval report --format table
+```
+
+```
+Task: add-retry-logic (3 runs each)
+┌──────────────┬───────────┬────────┬────────┬─────────────┐
+│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
+├──────────────┼───────────┼────────┼────────┼─────────────┤
+│ claude-code  │ 3/3       │ $0.12  │ 45s    │ pass@3 100% │
+│ aider        │ 2/3       │ $0.08  │ 38s    │ pass@3  67% │
+└──────────────┴───────────┴────────┴────────┴─────────────┘
+```
+
+## Judge Types
+
+### Code-Based (deterministic)
+
+```yaml
+judge:
+  - type: pytest
+    command: pytest tests/ -v
+  - type: command
+    command: npm run build
+```
+
+### Pattern-Based
+
+```yaml
+judge:
+  - type: grep
+    pattern: "class.*Retry"
+    files: src/**/*.py
+```
+
+### Model-Based (LLM-as-judge)
+
+```yaml
+judge:
+  - type: llm
+    prompt: |
+      Does this implementation correctly handle exponential backoff?
+      Check for: max retries, increasing delays, jitter.
+```
+
+## Best Practices
+
+- **Start with 3-5 tasks** that represent your real workload, not toy examples
+- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic
+- **Pin the commit** in your task YAML so results are reproducible across days/weeks
+- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise
+- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice
+- **Version your task definitions** — they are test fixtures, treat them as code
+
+## When to Use
+
+- Evaluating whether to switch from one coding agent to another
+- Benchmarking a new model release against your baseline
+- Building internal evidence for tool adoption decisions
+- Running periodic regression checks on agent quality
+
+## Links
+
+- Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)


⚠️ Potential issue | 🟡 Minor

Add explicit How It Works and Examples section headings to match skill format requirements.

The content already covers both, but the required section titles are not explicitly present.

✏️ Suggested doc-structure tweak

-## Workflow +## How It Works ... -## Judge Types +## Examples + +### Judge Types ... +### CLI Usage Examples

As per coding guidelines, "Skills should be formatted as Markdown with clear sections for When to Use, How It Works, and Examples."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/agent-eval/SKILL.md` around lines 55 - 147, Add explicit "How It Works" and "Examples" headings to match the skill format: under the existing "Workflow" content insert a "How It Works" heading that briefly summarizes the three-step flow (Define Tasks, Run Agents, Compare Results) and references the steps already shown, and create an "Examples" heading that contains the concrete CLI examples (the agent-eval run command and the agent-eval report command plus the ASCII table output) so readers can find runnable examples; update the document near the "Workflow" and before "Judge Types"/"Best Practices" and ensure the headings are exactly "How It Works" and "Examples" to satisfy the guideline.

cubic-dev-ai

5 issues found across 1 file

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/agent-eval/SKILL.md">

<violation number="1" location="skills/agent-eval/SKILL.md:4">
P2: Use the repository’s standard `origin` value for skills to avoid downstream tooling misclassification.</violation>

<violation number="2" location="skills/agent-eval/SKILL.md:28">
P3: Include a `commit` field in the canonical task YAML example to align with the reproducibility guidance; otherwise users are likely to run non-reproducible evaluations from the default example.</violation>

<violation number="3" location="skills/agent-eval/SKILL.md:44">
P2: The documentation overstates git worktree isolation as a security boundary, which can mislead users into unsafe assumptions.</violation>

<violation number="4" location="skills/agent-eval/SKILL.md:53">
P2: The document mislabels a per-run pass fraction as pass@k, producing mathematically incorrect metric semantics.</violation>

<violation number="5" location="skills/agent-eval/SKILL.md:147">
P1: User-facing skill links to an unvetted external repository, creating supply-chain and review-boundary risk.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

skills/agent-eval/SKILL.md

…kill - Remove duplicate "When to Use" section (kept "When to Activate") - Add Installation section with pip install instructions - Change origin from "community" to "ECC" per repo convention - Add commit field to YAML task example for reproducibility - Fix pass@k mislabeling to "pass rate across repeated runs" - Soften worktree isolation language to "reproducibility isolation" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

skills/agent-eval/SKILL.md

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/agent-eval/SKILL.md">

<violation number="1" location="skills/agent-eval/SKILL.md:22">
P1: Installation docs use an unpinned external GitHub VCS install, creating reproducibility and supply-chain risk.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

skills/agent-eval/SKILL.md

Address PR review feedback: pin the VCS install to commit 6d062a2 to avoid supply-chain risk from unpinned external deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/agent-eval/SKILL.md">

<violation number="1" location="skills/agent-eval/SKILL.md:23">
P1: User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-16T23:21:56Z

skills/agent-eval/SKILL.md

+
+```bash
+# pinned to v0.1.0 — latest stable commit
+pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b


P1: User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/agent-eval/SKILL.md, line 23: <comment>User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.</comment> <file context> @@ -19,7 +19,8 @@ A lightweight CLI tool for comparing coding agents head-to-head on reproducible ```bash -pip install git+https://github.com/joaquinhuigomez/agent-eval.git +# pinned to v0.1.0 — latest stable commit +pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

</file context>

</details> <a href="https://www.cubic.dev/action/fix/violation/b84d5ce3-68b1-43f2-921c-c368bae24585" target="_blank" rel="noopener noreferrer" data-no-image-dialog="true"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://cubic.dev/buttons/fix-with-cubic-dark.svg"> <source media="(prefers-color-scheme: light)" srcset="https://cubic.dev/buttons/fix-with-cubic-light.svg"> <img alt="Fix with Cubic" src="https://cubic.dev/buttons/fix-with-cubic-dark.svg"> </picture> </a>

greptile-apps · 2026-03-16T23:26:04Z

skills/agent-eval/SKILL.md

+# pinned to v0.1.0 — latest stable commit
+pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b


Misleading version label in install comment

The comment says # pinned to v0.1.0 — latest stable commit but the install command references a raw commit SHA rather than a v0.1.0 git tag. If a v0.1.0 tag exists in the upstream repo, using the named tag is more readable, more transparent, and lets users immediately see what version they're installing:

Suggested change

# pinned to v0.1.0 — latest stable commit

pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

# pinned to v0.1.0 — latest stable release

pip install git+https://github.com/joaquinhuigomez/agent-eval.git@v0.1.0

If v0.1.0 is not an actual tag on the upstream repo and the SHA was simply labelled "v0.1.0" for convenience, the comment is misleading. Either push and reference a proper tag, or update the comment to clarify it is a specific commit (e.g. # pinned to commit 6d062a2).

feat(skills): add agent-eval for head-to-head coding agent comparison

c7f0def

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

skills/agent-eval/SKILL.md Outdated Show resolved Hide resolved

skills/agent-eval/SKILL.md Show resolved Hide resolved

skills/agent-eval/SKILL.md Outdated Show resolved Hide resolved

skills/agent-eval/SKILL.md Show resolved Hide resolved

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

cubic-dev-ai bot reviewed Mar 16, 2026

View reviewed changes

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

skills/agent-eval/SKILL.md Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 16, 2026

View reviewed changes

skills/agent-eval/SKILL.md Outdated Show resolved Hide resolved

Pin agent-eval install to specific commit hash

82e304c

Address PR review feedback: pin the VCS install to commit 6d062a2 to avoid supply-chain risk from unpinned external deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cubic-dev-ai bot reviewed Mar 16, 2026

View reviewed changes

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

		# pinned to v0.1.0 — latest stable commit
		pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

Uh oh!

Conversation

joaquinhuigomez commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type

What it covers

Testing

Checklist

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

greptile-apps bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joaquinhuigomez commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

greptile-apps bot commented Mar 16, 2026 •

edited

Loading

cubic-dev-ai bot Mar 16, 2026 •

edited

Loading