feat(skills): add agent-eval for head-to-head coding agent comparison#540
feat(skills): add agent-eval for head-to-head coding agent comparison#540joaquinhuigomez wants to merge 3 commits intoaffaan-m:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a new Markdown documentation file that describes a head-to-head agent evaluation workflow, covering activation, installation, YAML task definitions, git worktree isolation, metrics, judge types, workflow steps, best practices, and example repository links. (48 words) Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR adds All issues raised in the previous review round have been resolved: Remaining notes:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant CLI as agent-eval CLI
participant Worktree as Git Worktree
participant Agent as Coding Agent
participant Judge as Judge (pytest/grep/llm)
participant Report as Report Generator
User->>CLI: agent-eval run --task task.yaml --agent claude-code --runs 3
loop For each run (1..k)
CLI->>Worktree: Create fresh worktree from pinned commit
CLI->>Agent: Hand off task prompt + files
Agent-->>Worktree: Write code changes
CLI->>Judge: Run judge criteria (pytest / grep / llm)
Judge-->>CLI: pass/fail, cost, time
CLI->>Worktree: Tear down worktree
end
CLI-->>User: Results stored
User->>CLI: agent-eval report --format table
CLI->>Report: Aggregate pass rate, cost, time, consistency
Report-->>User: Comparison table (pass@k per agent)
Last reviewed commit: 82e304c |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/agent-eval/SKILL.md`:
- Around line 55-147: Add explicit "How It Works" and "Examples" headings to
match the skill format: under the existing "Workflow" content insert a "How It
Works" heading that briefly summarizes the three-step flow (Define Tasks, Run
Agents, Compare Results) and references the steps already shown, and create an
"Examples" heading that contains the concrete CLI examples (the agent-eval run
command and the agent-eval report command plus the ASCII table output) so
readers can find runnable examples; update the document near the "Workflow" and
before "Judge Types"/"Best Practices" and ensure the headings are exactly "How
It Works" and "Examples" to satisfy the guideline.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 32be15e6-66d7-4636-b9bc-a8c5e84f65f3
📒 Files selected for processing (1)
skills/agent-eval/SKILL.md
| ## Workflow | ||
|
|
||
| ### 1. Define Tasks | ||
|
|
||
| Create a `tasks/` directory with YAML files, one per task: | ||
|
|
||
| ```bash | ||
| mkdir tasks | ||
| # Write task definitions (see template above) | ||
| ``` | ||
|
|
||
| ### 2. Run Agents | ||
|
|
||
| Execute agents against your tasks: | ||
|
|
||
| ```bash | ||
| agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3 | ||
| ``` | ||
|
|
||
| Each run: | ||
| 1. Creates a fresh git worktree from the specified commit | ||
| 2. Hands the prompt to the agent | ||
| 3. Runs the judge criteria | ||
| 4. Records pass/fail, cost, and time | ||
|
|
||
| ### 3. Compare Results | ||
|
|
||
| Generate a comparison report: | ||
|
|
||
| ```bash | ||
| agent-eval report --format table | ||
| ``` | ||
|
|
||
| ``` | ||
| Task: add-retry-logic (3 runs each) | ||
| ┌──────────────┬───────────┬────────┬────────┬─────────────┐ | ||
| │ Agent │ Pass Rate │ Cost │ Time │ Consistency │ | ||
| ├──────────────┼───────────┼────────┼────────┼─────────────┤ | ||
| │ claude-code │ 3/3 │ $0.12 │ 45s │ pass@3 100% │ | ||
| │ aider │ 2/3 │ $0.08 │ 38s │ pass@3 67% │ | ||
| └──────────────┴───────────┴────────┴────────┴─────────────┘ | ||
| ``` | ||
|
|
||
| ## Judge Types | ||
|
|
||
| ### Code-Based (deterministic) | ||
|
|
||
| ```yaml | ||
| judge: | ||
| - type: pytest | ||
| command: pytest tests/ -v | ||
| - type: command | ||
| command: npm run build | ||
| ``` | ||
|
|
||
| ### Pattern-Based | ||
|
|
||
| ```yaml | ||
| judge: | ||
| - type: grep | ||
| pattern: "class.*Retry" | ||
| files: src/**/*.py | ||
| ``` | ||
|
|
||
| ### Model-Based (LLM-as-judge) | ||
|
|
||
| ```yaml | ||
| judge: | ||
| - type: llm | ||
| prompt: | | ||
| Does this implementation correctly handle exponential backoff? | ||
| Check for: max retries, increasing delays, jitter. | ||
| ``` | ||
|
|
||
| ## Best Practices | ||
|
|
||
| - **Start with 3-5 tasks** that represent your real workload, not toy examples | ||
| - **Run at least 3 trials** per agent to capture variance — agents are non-deterministic | ||
| - **Pin the commit** in your task YAML so results are reproducible across days/weeks | ||
| - **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise | ||
| - **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice | ||
| - **Version your task definitions** — they are test fixtures, treat them as code | ||
|
|
||
| ## When to Use | ||
|
|
||
| - Evaluating whether to switch from one coding agent to another | ||
| - Benchmarking a new model release against your baseline | ||
| - Building internal evidence for tool adoption decisions | ||
| - Running periodic regression checks on agent quality | ||
|
|
||
| ## Links | ||
|
|
||
| - Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval) |
There was a problem hiding this comment.
Add explicit How It Works and Examples section headings to match skill format requirements.
The content already covers both, but the required section titles are not explicitly present.
✏️ Suggested doc-structure tweak
-## Workflow
+## How It Works
...
-## Judge Types
+## Examples
+
+### Judge Types
...
+### CLI Usage ExamplesAs per coding guidelines, "Skills should be formatted as Markdown with clear sections for When to Use, How It Works, and Examples."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/agent-eval/SKILL.md` around lines 55 - 147, Add explicit "How It
Works" and "Examples" headings to match the skill format: under the existing
"Workflow" content insert a "How It Works" heading that briefly summarizes the
three-step flow (Define Tasks, Run Agents, Compare Results) and references the
steps already shown, and create an "Examples" heading that contains the concrete
CLI examples (the agent-eval run command and the agent-eval report command plus
the ASCII table output) so readers can find runnable examples; update the
document near the "Workflow" and before "Judge Types"/"Best Practices" and
ensure the headings are exactly "How It Works" and "Examples" to satisfy the
guideline.
There was a problem hiding this comment.
5 issues found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/agent-eval/SKILL.md">
<violation number="1" location="skills/agent-eval/SKILL.md:4">
P2: Use the repository’s standard `origin` value for skills to avoid downstream tooling misclassification.</violation>
<violation number="2" location="skills/agent-eval/SKILL.md:28">
P3: Include a `commit` field in the canonical task YAML example to align with the reproducibility guidance; otherwise users are likely to run non-reproducible evaluations from the default example.</violation>
<violation number="3" location="skills/agent-eval/SKILL.md:44">
P2: The documentation overstates git worktree isolation as a security boundary, which can mislead users into unsafe assumptions.</violation>
<violation number="4" location="skills/agent-eval/SKILL.md:53">
P2: The document mislabels a per-run pass fraction as pass@k, producing mathematically incorrect metric semantics.</violation>
<violation number="5" location="skills/agent-eval/SKILL.md:147">
P1: User-facing skill links to an unvetted external repository, creating supply-chain and review-boundary risk.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Add one-off context when rerunning by tagging
@cubic-dev-aiwith guidance or docs links (includingllms.txt) - Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…kill - Remove duplicate "When to Use" section (kept "When to Activate") - Add Installation section with pip install instructions - Change origin from "community" to "ECC" per repo convention - Add commit field to YAML task example for reproducibility - Fix pass@k mislabeling to "pass rate across repeated runs" - Soften worktree isolation language to "reproducibility isolation" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/agent-eval/SKILL.md">
<violation number="1" location="skills/agent-eval/SKILL.md:22">
P1: Installation docs use an unpinned external GitHub VCS install, creating reproducibility and supply-chain risk.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Address PR review feedback: pin the VCS install to commit 6d062a2 to avoid supply-chain risk from unpinned external deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/agent-eval/SKILL.md">
<violation number="1" location="skills/agent-eval/SKILL.md:23">
P1: User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| ```bash | ||
| # pinned to v0.1.0 — latest stable commit | ||
| pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b |
There was a problem hiding this comment.
P1: User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/agent-eval/SKILL.md, line 23:
<comment>User-facing docs instruct installing executable code directly from an external GitHub repo via pip VCS URL, creating avoidable supply-chain and trust-boundary risk.</comment>
<file context>
@@ -19,7 +19,8 @@ A lightweight CLI tool for comparing coding agents head-to-head on reproducible
```bash
-pip install git+https://github.com/joaquinhuigomez/agent-eval.git
+# pinned to v0.1.0 — latest stable commit
+pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
</file context>
</details>
<a href="https://www.cubic.dev/action/fix/violation/b84d5ce3-68b1-43f2-921c-c368bae24585" target="_blank" rel="noopener noreferrer" data-no-image-dialog="true">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://cubic.dev/buttons/fix-with-cubic-dark.svg">
<source media="(prefers-color-scheme: light)" srcset="https://cubic.dev/buttons/fix-with-cubic-light.svg">
<img alt="Fix with Cubic" src="https://cubic.dev/buttons/fix-with-cubic-dark.svg">
</picture>
</a>
| # pinned to v0.1.0 — latest stable commit | ||
| pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b |
There was a problem hiding this comment.
Misleading version label in install comment
The comment says # pinned to v0.1.0 — latest stable commit but the install command references a raw commit SHA rather than a v0.1.0 git tag. If a v0.1.0 tag exists in the upstream repo, using the named tag is more readable, more transparent, and lets users immediately see what version they're installing:
| # pinned to v0.1.0 — latest stable commit | |
| pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b | |
| # pinned to v0.1.0 — latest stable release | |
| pip install git+https://github.com/joaquinhuigomez/agent-eval.git@v0.1.0 |
If v0.1.0 is not an actual tag on the upstream repo and the SHA was simply labelled "v0.1.0" for convenience, the comment is misleading. Either push and reference a proper tag, or update the comment to clarify it is a specific commit (e.g. # pinned to commit 6d062a2).
Summary
Adds
agent-evalas a skill — a lightweight CLI tool for comparing coding agents (Claude Code, Aider, Codex, etc.) head-to-head on custom tasks using YAML task definitions, git worktree isolation, and deterministic + model-based judges.Type
What it covers
Testing
Checklist
Summary by cubic
Adds the
agent-evalskill, a lightweight CLI to compare coding agents on reproducible tasks and report pass rate, cost, time, and consistency. Helps teams make data-backed choices and run regressions on their own codebases.skills/agent-eval/SKILL.mdwith activation steps, pinnedpipinstall to a commit, workflow, and examples.commitpin for reproducibility.ECC).Written for commit 82e304c. Summary will update on new commits.
Summary by CodeRabbit