Automated AI skill testing and improvement, optimized for token efficiency.
Your AI skill works most of the time. But some inputs fail unpredictably, and you don't know why. The usual fix is manual trial and error: tweak the prompt, try it out, hope it's better.
skill-optimizer replaces that guesswork with a measurable loop: define what "good output" looks like, run automated tests, get a pass rate, make one targeted change, see if it improves. Every decision is backed by a number, not a feeling.
The catch with existing tools is cost: re-running all test cases after every change, grading outputs one by one, passing full context to every agent call — it adds up fast. This project fixes that without cutting corners on quality.
| Strategy | Mechanism | Savings |
|---|---|---|
| Partial re-eval | After a change, re-run only failing cases first. Skip full suite if partial doesn't improve. | ~50% on re-tests |
| Batch grading | All outputs graded in one LLM call (matrix output), not N calls | ~(N-1)/N per eval round |
| Structured context | Improver receives JSON-schema failure metadata, not narrative history | ~60% on improver input |
| Diff format | Changes expressed as unified diffs, not repeated full file | ~70% on change context |
| Failure clustering | Group failures by root cause; fix the cause, not each symptom | Fewer iterations needed |
Context compression is only applied to metadata (failure clusters, changelog), never to skill content itself. The improver agent always receives the verbatim text of the skill section it needs to reason about. This is the safe boundary: compact schemas for structured data, full text for nuanced prose.
Static 6-dimension audit. No test cases needed. Cheap.
Tell Claude: "Analyze the [skill-name] skill"
Outputs a scored diagnostic report covering structural clarity, instruction completeness, edge case coverage, consistency, actionability, and quality assurance.
Eval-driven loop with auto-improvements.
Tell Claude: "Optimize the [skill-name] skill"
Claude collects test cases and a checklist, runs a baseline, then iterates: analyze failures → propose one change → partial re-eval → full re-eval if partial improves → keep or revert.
Estimate cost before committing to a run:
npx ts-node src/cli/token-estimate.ts path/to/SKILL.md 5 4 10Token Cost Estimate
══════════════════════════════════════════════════
Skill file: 820 tokens
Test cases: 5
Checklist items: 4
Max iterations: 10
Analysis only: ~1,120 tokens
Optimization run:
Baseline eval: ~2,350 tokens
Per iteration: ~2,700 tokens
partial eval: ~940
full eval×60%: ~1,410
improver: ~350
Total (est.): ~29,350 tokens
Naive approach: ~52,000 tokens
Estimated savings: ~44%
✓ Low token cost. Good to proceed.
Cache prompt outputs by content hash. If the skill and test case input haven't changed since last run, reuse the cached output:
# Check cache before running a case
npx ts-node src/cli/cache.ts check SKILL.md case-2 .cache/
# Store output after running
npx ts-node src/cli/cache.ts store SKILL.md case-2 output.md .cache/
# View cache stats
npx ts-node src/cli/cache.ts stats .cache/
# Clear all cached entries
npx ts-node src/cli/cache.ts clear .cache/Browse optimization runs interactively:
cd src/viewer
npm install
npm run devDrop a changelog.json from any {name}-workspace/ directory to explore:
- Run summary (baseline → final pass rate, tokens spent)
- Pass-rate sparkline across iterations
- Per-iteration cards with verdict, delta, and change type
- Diff viewer for each iteration's proposed change
skill-optimizer/
├── SKILL.md Main orchestration skill (read by Claude)
├── agents/
│ ├── analyzer.md 6-dimension static analysis (single call)
│ ├── grader.md Batch grading — all cases in one call
│ ├── improver.md Failure analysis + unified diff proposal
│ ├── eval_designer.md Checklist design with token-aware guidance
│ └── case_extractor.md Test case extraction from history
├── references/
│ ├── eval_guide.md How to write effective Yes/No checklists
│ └── schemas.md JSON schemas for all output files
├── src/
│ ├── shared/
│ │ └── types.ts Shared TypeScript types
│ ├── cli/
│ │ ├── token-estimate.ts Pre-run cost estimator
│ │ └── cache.ts Hash-based output cache
│ └── viewer/
│ ├── App.tsx React eval viewer
│ ├── main.tsx Entry point
│ ├── index.html
│ ├── vite.config.ts
│ └── package.json
├── package.json
└── tsconfig.json
Set these when telling Claude to optimize a skill, e.g. "optimize with auto_apply off":
| Parameter | Default | Description |
|---|---|---|
auto_apply |
true |
false = pause for approval before each change |
target_pass_rate |
0.9 |
Stop when reached |
max_iterations |
10 |
Hard stop |
stall_limit |
3 |
Stop after N consecutive no-improvement rounds |
runs_per_case |
1 |
Set to 3 for flaky skills (uses majority vote) |
partial_eval |
true |
false = always run full eval (more accurate, more tokens) |
max_cases |
7 |
Warn if user provides more |
After a run:
{name}-workspace/
├── analysis_report.md Phase 0 diagnostic (if run)
├── analysis_report.json
├── optimization_report.md Full history with token costs
├── changelog.json Load this in the eval viewer
└── iter-{N}/
├── diff.patch Change applied this iteration
├── token_log.json Per-iteration cost breakdown
└── eval-{id}/
├── output.md Skill output for this test case
└── grading.json Checklist results
Works with Claude Code, OpenCode, or any AI coding assistant that can read markdown files and execute TypeScript.
MIT